r 


AD-A013  583 

A NEW  TIME-UOMAIN  ANALYSIS  OF  HUMAN  SPEECH  AND  OTHER 
COMPLEX  waveform: 

Janet  Maclvei^  Baker 

Carnegi e-Mel  1 on  University 


Prepared  fur: 

Air  Force  Office  of  Scientific  Research 
Defense  Advanced  Research  Projects  Agency 


May  1975 


DISTRIBUTED  BY: 


National  Technical  Information  Service 
U.  S.  DEPARTMENT  OF  COMMERCE 


I 


t. 

fr 


UNCLASSIFIED 


SeCUBiTv  Cl  *SSl  HC  *T|Ok  Of-'  TmiS  PAOF.  [islm  tnfrodj 


REPORT  DOCUMENTATION  PAGE 

READ  INSTRUCTin-  •; 
BEFORE  COMPt.ETI.NG  FO-’M 

1 report  number  govt  accession  no. 

1 RECIRIEN  T'S  C » T ALOO  NUMBEfS 

A F C ^ f • .o''  i \ ^ j 

1 

4 title 

5.  TYPE  OE  REPOR*  « PERIOD  COVEFED 

A NEW  TIME-DUMAIN  ANALYSIS  OF  HUMAN  SPEECH 

Interim 

AND  OTHER  COMPLEX  WAVEFORMS 

6 PERFORMING  OPG  OEPQP'^  NUMREP 

7.  AuTHORr«J 

0.  CONTRA<“T  OR  GRANT  NUMBER^*) 

Janet  Maclver  Baker 

F44620-73-C-0074 

• pereormi*  j organization  name  and  aooress 

10.  program  element.  PROJECT  TA3< 

Carnegie-Mellon  University 

AREA  A WORK  UNIT  NUMBERS 

61101D 

Computer  Science  Dept, 

Pittsburgh,  PA  15213 

AO-2466  1 

" CONTROLLING  ")EEICE  NAME  AnO  AOORESS 

12.  REPORT  DATE 

Defense  Advanced  Research  Projects  Agency 

May  1975 

1400  Wilson  Blvd 

13.  NUMBER  OP  PAGES 

Arlington.  VA  22209 

ITd 

lA  monitoring  aGENCv  name  a AOORESSC// rf///ff»nr  horn  Conirnlling  Oltir}) 

15.  SECUPJTV  CLASS  'of  thl»  repnri) 

Air  Force  Office  of  Sciencific  Research/UM 

1400  'Wilson  Blvd 

UNCLASSI.'^IED 

Arlington,  VA  22209 

1S».  OCCLASSIFIC  ATION  OOWnGRADING 

SCHEOULE 

IS  OlSTRieuTlON  ST  ATEMENT  Co/  R«pof(J 

1 . . 1 

1 Approvea  ror  puoiic  release;  aistrxDution  uniimicea.  1 

*7  O*S'’'Ri0'i7'(ON  statement  fof  mbttrmct  ^nttrad  In  BIcck  30,  If  different  from  Rtport) 

I*  SUP“L EMENTARV  notes 

^9-  #CEY  WO^OS  ^Conl/nu«  on  rovoroo  9id0  if  n9Ct»9Mfy  9nd  identify  by  *-'-‘>ek  mjtnb9r) 

1 20  ACS^®A  ^ ^f'ontlnu9  fmvmrtr  0ld0  If  nyc99»9ry  0nd  Identify  hv  block  numb*>rt  | 

1 The  purpose  of  this  research  is  to  explore  the  usefulness  of  a 

new  time- .domain  analysis  of 

complex  waveforms,  especially  with  rcspcc;  to  human  speech.  A chief  advantage  of  limc-Jonrin 

analysis  is  its  precise  temporal  resolution,  which  is  particularly  useful  for  characterizing  vciy  short 

duration  c\cnis  or  regions  of  rapid  chani'c.  A significant  portion  of  the  speech  waseform  consists 

of  such  regions,  especially  at  phone  bounLiaries.  In  addition,  it  is  well-known  that  the  intelhyilnliiy  - 

DD  . 1473  EOlTiON  OF  1 NOV  65  IS  OBSOLETE  UNC I AS  S I FIF  D 


SECUBI'’”^  classification  OF  Tmis  p«<^E  H-f*  fnt'^r^d) 


UNCLASSIFIED 


seCUPiTV  CL  ASSIFiCATiON  Of  THIS  PAGEf»Ti«n  Daia  Entmtmdj 


Block  20/Abstract 

of  human  speech  is  in  large  part,  conveyed  by  acoustic  transitional  states,  generally  encompassed 
by  the  consonantal  elements  (c.f.  written  Htbrew  where  specific  vowel  designations  arc  omitted). 

In  contrast  to  classical  frequency-domain  analysis,  this  time-domain  analysis  has  no  inherent 
bandwidth  limitation;  that  is.  there  is  no  trade-off  between  time  and  frequency  resolution.  .-\ 
major  advantage  is  gained  thereby,  for  certain  purposes.  On  the  other  hand,  acoustic  steady  states, 
such  as  those  rsually  caused  by  the  vowels  in  human  speech,  are  best  characterized  in  the 
frequency-domain.  Therefore,  despite  redundancy,  much  of  the  information  in  these  two  domains 
is  complementary  in  nature. 

The  motiv.ation  underlying  tii-e  time-demain  studies  presented  here,  arises  from  compeirng 
evidence,  both  perceptual  and  physiological.  The  perceptual  evidence  relates  to  the  high  intelligi- 
bility of  infinitely  peak-clipped  cpccch.  while  the  physiological  evidence  relates  to  the  coding 
operation  known  as  “pna' c-locking".  performed  by  a iarge  number  o;  firs.-order  auditory  neurons. 
Our  signal  analyses  arc  -Jirectly  analagous  to  these  two  forms  of  information  extraction.  We  derive 
our  parameters  fro.u  the  individual  waveform  cycles,  which  are  defined  as  occuring  between 
successive  up-crossings  of  the  waveform  across  a zero-axis.  A important  distinction  between  this 
analysis  and  that  of  most  other  zero  and  up-crossing  analyses,  is  that  the  cycle  measures  arc  not 
uniformly  averaged  together;  therefore  we  picscnc  the  precise  temporal  resolution.  In  addition, 
we  have  developed  a visual  display,  the  "Log  Inverse  Period  or  LIP  plot,  which  provides  a very 
useful  representation  of  much  of  this  cyc)c-b  sed  information. 

Essentially  three  separate  investigations  arc  prese.nted.  with  the  last  two  predicated  on  the 
results  of  the  first. 

1)  Cyclc-baved  tmie-Jomain  parameurs  were  cxirr.c'.ed  from  the  speech  waveforn'«  of  r;nv 
hundreds  of  utterances,  and  were  then  subjected  to  exte.isive  scrutiny,  both  by  hand  and  by 
machine.  In  addition  to  investigating  a time-domain  characterization  of  speech  waveforms  in 

Continued 


» o/ 


UNCLASSIFIED 


general,  we  found  a number  of  new  acousiic  phenomena  of  speech.  Many  of  these  arc  of  short 
duration  and/or  low  amplitude,  and  are  frequently  found  at  phone  or  sub-phone  boundaries.  The 
existence  o'  such  events  contributes  to  a more  phone-discrete  viesv  of  continuous  speech  tlinn  is 
generally  held. 

2)  Based  solely  on  time-domain  phenomena  found  in  the  previous  study,  we  wrote  an 
automatic  segmentation  program  for  continuous  speech.  Given  the  same  data  set  and  compared 
against  other  segmentation  programs  available  *n  the  speech  community,  the  tinic-domain 
segmentation  ( ithout  speaker  training)  compared  favorably  in  all  respects,  and  superiorly  in 
certain  respects. 

. 3)  We  examined  the  time-domain  acoustic  charactetistics  of  22S  allonliones  of  fricatives  and 
stop  consonants,  for  each  of  three  speakers  (2  maies,  1 female).  We  determined  consistent 
acoustic  differences  between  tliese,  and  demonstrated  that  given  this  understand. ng.  gooo 

discrimination  in  pair-wise  phone  comparisons  is  possible,  both  among  the  stop  consonants  and 
among  the  fricatives,  even  without  eontextual  information. 

Finally,  we  present  u personal  view  of  the  synergism  inherent  in  the  utilization  of  these 
time-domain  techniques  witli  the  traditional  frequency -donvain  techniques.  In  addition,  sugges- 
tions arc  p'cscntcd  for  applying  these  gcncralizable  time-domain  techniques  to  other  coinplo.x 
waveforms,  especially  amenable  to  such  analysis.  Specific  examples  arc  drawn  from  music  (o.g. 
violin)  and  animal  (c.g.  bou-bou  shrike)  vocalizations. 


A M;\V  I IMF-I)(),MAIN  analysis  of  human  SI'KKCII 
AM)  OniFR  ( OMPI.F.X  WAVKFORMS 


Janet  Mac IvL-r  Hakcr 
May  1975 


SiihmiUcil  in  partial  fiiHiilnic'irt  of  the  rec|iiirernents  for  the 
ilejiree  of  Doctor  of  Philosophy  in  ifioeommunieatioii  anil 
( 'oniputer  Science 

Mellon  Institute  ol  Seieiiee 
( arnepie- Mellon  I nicersity 
l’ittshiirf;h.  IVnnsylvania 


This  work  was  supported  by  the  Defense  Advanced  Research  Projects 
Agency  under  contract  F44620 -73-C-0.)7d  . 


Ai'kmiMlcdgcincnts 


I cr.pccially  wi‘  h to  express  my  sinecrest  thanks  and  g/atiUK!''  to  R.  Reddy,  for  generously 
providing  the  support  and  rcseareh  freedom  necessary  to  pursue  these  '.ludies,  to  Allen  Newell, 
who  is  largely  responsible  for  the  excellent  speech  research  miliee  at  Carncgic-Mellon  University, 
to  J.  C.  R.  Licklidcr,  whose  infinite  peak-dipping  experiments  demonstrated  the  significance  of 
time-domain  information,  to  my  long-time  mentor,  Jerome  Y.  Lettvin.  whose  spirit  and  CLOOGE 
provided  the  spark,  and  to  my  most  generous  and  understanding  colleague,  James  K.  Baker,  with 
whom  anything  is  possible!  This  research  was  supported  in  part  by  the  Advai.ccd  Rcseareh 
Projects  Agency  of  the  Department  of  Defense  undci  'ontract  no.  I•4462()  73-C-0074  and 
monitored  by  the  Air  Force  Office  of  Scientific  Rcsearcti.  F'inal  editing  and  revisions  of  the 
dissertation  were  done  while  the  author  was  with  the  Speech  Processing  Group.  Computer  Science 
Department,  IBM  Thomas  J.  Watson  Research  Center,  Yorktown  Heights,  New  York. 


iii 


TABLE  OF  CON  I ENTS 

The  Preface 

1 

1.  The  Time-Domain  and  Speech 

4 

!l.  Characteristics  of  Speech  in  the  Time-Domain 

13 

MI.  Segmentation  of  Speech 

64 

IV.  Allophones  of  Fricatives  and  Stop  Consonants 

83 

V.  Violins,  Bou-Bou  Shrikes,  and  Other  Applications 

97 

Appendices: 

A.  Sentence  Speetrograms  an.J  LIP  Plots 

1 15 

B.  Sentenee  Segmentation  Results 

C.  Speaker  Parameters  for  Phone  Recognition  Study 

!37 

142 

Bibliography 

149 

I’jgc  I 


PRI  I ACi: 


The  central  aim  ol  this  research  investigation  is  lo  examine  what  kinds  of  information  somr 
new  techniques  of  time  domain  analysis  reveal  abou!  ihc  complex  waveforms  of  speech.  A chief 
advantage  of  lime-domam  analysis,  in  general,  is  its  precise  tcinporal  resolution,  which  is  particu- 
larly useful  for  studying  regions  of  rapid  change  as  well  as  very  short  duration  events.  A signifi- 
can  portion  of  the  speech  v-aveform  consists  of  such  regions,  espeoally  at  phone  boundaries. 

The  motivation  underlying  the  time-domain  studies  pre.sented  here,  arises  from  eompelling 
evidence,  both  peiccpfial  and  physiological.  The  perceptual  evidence  relates  to  the  high  inteliigi- 
bility  of  infinitely  peak-clipped  speech,  while  the  physiological  evidence  relates  to  the  coding 
operation  known  as  ' pl,ase-lo..king".  performed  by  a large  number  of  the  first-order  auditory 
neurons.  Infinite  peak-clipping  rectifies  the  waveform  (optionally  with  or  without  pie  , ;i  vation  of 
peak  and  valley  amplitude  information),  ihereby  preserving  only  the  times  of  waveform  iip- 
crossings  and  down-crossings  relative  lo  a hori/ontal  /cm  axis  ilrawii  through  the  waveform,  along 
with  the  appropriate  polarity  designation.  That  is.  as  the  waveform  crosses  the  axis,  the  Informa- 
tion retained  consists  of  the  time  of  the  crossing  and  its  direction,  up  or  down,  fn  contrast,  given 
a signal  waveform,  a phase-locking  neuron  fires  once,  phasc-consisinuly.  for  each  cycle  or  integer 
multiple  number  of  cycles  within  the  waveform.  i;sscnlia!ly.  the  operation  of  phase-locking  is 
equivalent  lo  infinite  peak-clipping  which  responds  only  to  either  waveform  up-crossings  or 
down-crossings,  but  not  hoih(  the  information  contained  in  a .sci  of  crossings  of  cither  polarity  is 
highly  redundant  with  tinii  contained  in  the  set  of  opposite  pulariiv).  Our  signal  analyses  arc 
diicctly  analogous  to  these  twe  forms  of  inlormation  cxtraciion.  We  derive  our  parameters  from 
the  individual  eyeles  in  a wavelorm.  A cycle  i.s  defined  as  iha  puiiion  of  waveform  oeeuring 
between  two  eonseeiitive  up-erossings  of  the  waveform  across  a /em-axis  drawn  through  its 
center.  The  inverse  ol  the  cycle  period  (the  duration  of  lime  hetvveen  suecessive  up-emssings)  is 
refered  lo  as  "cvc!c-freq,ieney  " the  other  parameters  used  are  also  exiraeted  from  individual 
cycles  of  the  waveform,  in  the  visual  di.spiny  we  have  developed,  the  log  ininm  of  the  fundamenl.,1 
inverse  period  or  cycle-frequeney  measure  is  plotted  on  the  y-axis  againsi  !ime  on  the  x-axls.  This 


I’;igc  2 


diNpkiy  is  named  the  "[.og  Inverse  Period"  or  "LIP  ' plot. 

In  the  e’,ipl()ration  of  of  mir  time-domain  methods,  we  have  eoiidueted  both  basic  and  applied 
rese.irch.  Presented  here  are  three  separate  investigations;  the  last  two  are  predicated  on  the 
results  of  the  first. 

1 ) Cyele-based  t me-domain  parameters  were  extracted  from  speech  waveforms,  and  then 
suhjeeted  to  extensive  scrutiny,  boti:  by  hand  and  machine.  M;my  hundreds  of  utterances  were 
;inalyzed.  These  included  sizable  numbers  each  of  nonscii.se  syllahles,  citation  form  phrases,  and 
i.omplete  sentences  of  eonneeted  speech,  with  all  forms  of  speech  spoken  by  both  males  and 
females.  These  studies  have  enabled  us  to  describe  in  the  lime-domain  many  of  the  acoustic 
characteristics  of  speech  well-known  from  traditional  frcqucncy-doimiin  studies,  as  well  as  to 
discover  many  new  acoustic  phenomena  of  speech.  Some  of  thc.se  phenomena  comprise  very  short 
duration  events,  often  low  in  amplitude. which  occur  at  phone  boundaries  or  even  sub-phone 
bounvlaries  where  acoustic  states  change  rapidly.  The  existence  of  such  events  contributes  to  a 
more  phone-di.screle  view  of  continuous  speech  than  is  generally  held. 

2)  P»ased  solely  on  the  time-domain  phenomena  found  and  investigated  in  the  previous  sludy, 
we  wrote  a program  foi  automatic  segmentation  of  continuous  spt..ch.  This  program  was  run  on  a 
set  of  utterances  submitted  to  the  Segmentation  and  Labeling  Workshop  !ield  at  Carnegie-Mcllon 
University  (July, 1973).  The  results  were  then  compared  with  those  p, resented  by  other  groups  of 
the  soeech  community  I bis  r valuation  dcmonsl.'ated  that  the  time-domain  ba.sed  segmentation 
(without  speaker  training)  compared  favorably  in  all  respects,  and  superiorly  in  certain  respects,  to 
the  segmentation  results  of  the  other  programs. 

3)  We  studied  differences  in  the  time-domain  acoustic  charaetei isiies  among  a set  of  22<S 
allophones  of  fricatives  and  stop  consonants,  spoken  hy  each  of  three  speakers  (2  niale,  I fen, .do). 
The  aim  of  this  study  was  to  examine  if  different  commonly  occurring  contexts  of  a given  phone 
cause  changes  in  the  acoustic  manifestations  of  that  phone.  And  if  such  changes  do  occur,  are  they 
regular?  f-'or  example,  is  a /p,/  in  a retroflexed  environment  acoustically  different  from  a /p/  in  a 


nasalized  environment?  The  answer  is  "yes".  What’s  mnie,  the  effect  of  retroflexion  on  ' /p/  is 


simiLir  to  that  on  a ,'b  ',/tl/.,  t/./j;/.  and  /k/!  Wc  performed  striniieot  pair-wise  phone  reeouni- 
tion  tests  wh-eh  were  based  on  the  computed  allophone  effeets  derived  from  single  instanees  of 
phones  not  included  in  the  test  set.  Tor  example,  we  tried  to  distinguish  /k/  allophones  from  /t/ 
allophones,  based  on  the  allophone  statistics  collected  from  the  combined  set  of  allophones  frtnn 
/b/,/p/./g/.  and  /d/  by  the  same  speaker.  Significant  statistical  results  for  phone  recognition 
were  obtained,  thereby  supporting  tlic  concept  of  regidar  iiemisiie  allo-./rone  translormations  for 
commonly  occurring  fricatives  and  .stop  consonants  of  general  I'nglish. 

Finally,  we  present  a personal  view  of  the  synergism  inherent  in  the  utilization  of  these 
time-ilomain  teehniques  with  the  traditional  ffequeney-domaiii  !eehnii|it:s.  In  adilition,  sugges- 
tions are  presentetl  for  applying  these  generali/.able  time-dontain  methods  to  other  complex 
waveforms,  especially  amenable  to  such  analysis.  Specific  examples  are  drawn  from  mus'c  and 


animal  \oealizaiions 


C hap  vr  I - Tl II,  I IMI  -DOMAIN  AND  S!M.I.(  II 


IMRODl’CriON 

The  acousiics  of  human  speech  a'  ' of  other  complex  animal  vocali/.alions  has  lonj;  been  an 
area  of  primary  interest  to  a number  of  researchers.  liasie  findinjts  in  this  area  are  directly  relevant 
lor  linguistic  studies,  speech  aids  for  the  handicapped,  child  speech  development,  eross-eulfjial 
language  speech  comparisons,  speech  compres.sion,  automatic  speech  recognition,  physiological 
processes,  perception,  anil  so  forth. 

THE  NEED  FOR  TIME-DOMAIN  INFORMATION 

Over  the  years,  tremendous  amounts  of  time,  energy,  and  expense  have  bee.  devoted  to 
studying  the  acoustics  of  speech,  both  for  basic  research  and  for  applications  such  as  tho.se  listed 
above.  These  investigations  have  chiefly  centered  around  analyses  of  speech  waveforms,  both 
analog  and  digital.  The.se  speech  waveforms  are  usually  derived  from  the  voltage-time  relations 
directly  proportional  to  changes  in  air  pressure  caused  by  speech.  A speech  waveform  plot, 
referred  to  as  an  "o.scillogram",  contains  complete  information  about  the  original  signal  from 
which  it  is  derived.  An  oscillogram  is  an  example  of  a time-domain  display  heeausc  time  is 
determined  for  any  given  event  or  feature  occurring  in  the  signal;  f.)r  example,  the  time  of 
maximum  amplitude  oecurrenec  during  a pitch  period.  Usually  the  speech  waveform  itself  exhibits 
a great  deal  of  variability  in  a continuous  fashion.  In  general,  except  for  stressed  vowels  and 
central  portions  of  slowly  spoken  phones,  a given  acoustic  state  ..oon  tiansitions  into  a different 
one.  Unfort. 'natdy,  past  studies  of  time-domain  methods  have  not  revealed  good  ways  of  reducing 
’.iiis  large  bulk  of  highly  variable  data  and  de.iving  from  it  robust  and  useful  parameters  for  speech 
analysis.  Therefore  time-domain  analyses  of  the  waveform  have  generally  been  relegated  a very 
minor  role  iu  speech  studies,  the  major  role  being  playe;J  by  frequency-domain  analyses. 

Frequency-domain  analyses,  largely  dominated  by  speetrographie  studies,  average  waveform 
frequency  components  over  some  fixed  interval  of  tine  in  order  to  derive  spectral  measures  of  the 
signal.  This  averaging  means  that  in  the  frequency-domain,  there  exists  an  inherent  bandwidth 
limitation  such  that  the  better  the  frequency  resolution,  the  worse  the  time  resolution,  and  vice 
v-ersn.  lor  steady  state  pe.iodie  or  quasi-periodie  signals,  frequency-domain  anaivses  reveal  good 


ChapUr  I - 1iI1:TIMI.-1)(),NJAIN  AND  SI’I'J.CIl 


Paj;c  5 


information  on  the  w ivcforin  compo”cnt  frccjncncics.  In  luiiiitin  speech,  eliiefly  vowels  satisfy  this 
criteria  and  are  therefore  most  amumole  to  this  analysis.  Vowels,  especially  stressed  vowels,  are 
eh:iraeteri/.ed  by  waveforms  whieh  are  relatively  steady  state  with  pitch  periods  consisting  of 
several  cycLs  each.  I rcqueney  domain  me'hods,  with  spetiker  training,  ha.e  proven  fjuite 
successful  in  reliably  eharae'eri/.ing  vowels,  especially  with  respect  to  fori.nant  structures. Howev- 
er, many  other  phones  with  iion-periodic  waveforms,  often  undergoing  rapid  frequency  and 
amplitude  moiluhr.ions,  are  much  more  poorly  described  in  the  frequeney-doniain.  Here  the 
bandwidth  limitation  becomes  a serious  hindrance  because  the  time  during  whieh  an  acoustic 
feature  of  a consonant  or  phone  change  occurs,  is  often  too  short  for  good  •requeney  discrimina- 
tions. The  iniplie.itioiis  o*'  freqiieney-avcragiag  in  such  eases  must  be  carefully  consirlered. 

When  cla.ssic;il  frequency-domain  methods  arc  applied  to  a signal  composed  of  rapidly 
ehangi  frequency  eharaeteristies.  the  resuh  is  an  average  of  its  component  frequencies,  which 
does  not  bear  a unique,  relationship  tei  the  original  signal.  Cletirly,  averaging  over  such  a period  of 
changing  frequency  docs  not  yield  a good  representatior  of  the  original  signal.  It  is  also  true  that 
the  magnitude  .ipceiruni  of  a given  waveform  over  the  interval  of  averaging  is  the  same  whether 
the  waveiorm  is  played  frontvvarels, backwards,  or  otherwise  trails! ormed  in  various  vvays. 

Hie  liaridw'  Ith  limitations  of  frequeney-doniain  methods  are  not  serious  providiitii  these 
hypothetical  fast-chan.eing  .signals  do  not  occur  or  occur  infrequently.  I lowevcr,  even  simple  visual 
inspection  of  speech  waveforms  ••cveals  an  ahundanee  of  such  situations. 

These  situations  free|uently  oeeiir  at  phonelie  boun.iaries  and  during  the  course  of  stops  and 
fricatives,  all  regions  charaeteri/ed  by  acoustic  transients.  Therefore,  due  to  tile  acoustic  structure 
of  the  wave'^orm  itself,  some  of  tlievc  features  are  only  ohservahle  with  precise  temporal  rcsol'Jtion 
of  (he  signal.  T'lierefore.  time -domain  lecliniipies  capahlc  of  such  resolution,  arc  requia’d. 

In  addition,  it  is  well-known  that  (he  intelligibility  of  liunian  speech  is,  in  large  part,  conveyed 
by  acoustic  transit  lal  states,  generally  encompassed  by  the  coiisonaiiSal  elements  (ef.  written 
Hebrew,  where  speci  ■ vt  elcsignalions  are  omitted).  The  vowels,  acoustically  relatively  steady 
state,  although  well  dcserii  d and  diseriniinalcd  on  the  ba  is  of  spcctrographie  features,  a.ssume  a 
lesser  role  in  speech  nitelligibilily.  Nonetheless,  both  (he  acoustic  features  ob.scrved  in  the 


Cliaplcr  1 - TIIKTIMK-DOMAIN  ANI)  SIMCW  II 


Hagc  f) 

liinc-iloniain  ami  those  obscrvcil  in  the  frcqiicncy-doinain  arc  iniporlaiU  for  speech  eharaclcri/a- 
liun.  That  is,  analyses  in  both  domains  arc  complementary.  In  adelition,  due  to  the  well-known 
redundancy  in  speech,  analysis  in  the  time-domain  yields  information  found  by  frequency-domain 
studies,  and  vice  versa.  That  is,  time-domain  analyses  reveal  some  information  about  the  steady 
state  vowels,  just  as  frequency-domain  analyses  reveal  some  information  about  the  transitional 
state  consonants. 

I'OIV  TO  FIND  time-domain  INFORMATION 

Having  made  a ease  for  the  necessity  of  characterizing  acoustic  transients  in  the  time-domain, 
we  therefore  propose  and  demonstrate  a means  for  doing  so.  We  have  extensively  investigated  a 
kind  of  time-domain  analysis  which  is  characteri/ed  by  parameteri/.ations  containing  much  of  the 
waveform  information  necessary  for  the  intc!:iginility  of  speech.  These  parameters  are  computed 
for  each  individual  cycle  in  the  speech  waveform.  A cycle  is  dc'  'nated  as  that  portion  of 
waveform  occurring  between  successive  up-crossings  of  the  waveform  across  a zero-axis  drawn 
horizontally  through  the  center  of  the  waveform.  Th.ee  kinds  of  cycle  measures  have  proven  very 
useful:  measures  of  the  period  or  cycle  duration,  amplitude,  and  cycle  microstructurc.  Of  ,hesc, 
the  most  interesting  is  the  cycL  period  which  measures  the  duration  of  a cycle  between  successive 
up-cros?  ings.  Hereafter,  the  inverse  or  reciprocal  of  this  period,  is  referred  to  as  "cycle- 
frequency".  The  choice  of  this  particular  parameter  is  motivi  ted  by  the  compelling  evidence 
derived  from  perceptual  stucMcs  on  infinitely  peak-clipped  speech  and  on  the  neurophysiological 
studies  of  phase-locking  in  auditory  inf<)rmation  information  processing. 

The  perceptual  studies  we  eimduetcd  about  25  years  ag«i  by  l.ieklider  and  his  colleagues 
|LI,L2|,  who  demonstrateJ  the  high  intelligibility  of  infinitely  peak-clipped  speech.  Infinite 
peak-clipping  reduces  the  complex  waveform  to  a rectangular  waveform  where  ali  cycles  are  equal 
in  amplitude.  The  only  informaiion  retained  by  this  transformation  is  the  time  and  direction  of 
waveform  zero-crossings.  These  experiments  clearly  showed  that  a great  deal  of  information  must 
be  encoded  in  the  temporal  pattern  of  zero-crossings  oii  the  speech  waveform  alone!  .Since  the 
time-domain  parameters  which  completely  characterize  this  i .oimation  are  also  easy  to  obtain 


C liupU-rl  - nib:  l.iViK-DOMAIN  AM)SIM.i;(  II 


Pufic  7 


aulomalically,  wc  have  lhorou{;hly  cxaniirad  llic.ic  and  oilier  related  lime-domain  paraniclcrs  in 
order  lo  discover  whal  they  may  reveal  about  speech.  Since  speech  is  very  redundant  however, 
this  information  may  well  be  encoded  in  other  aeoustie  features  of  the  waveform  as  well. 

The  idea  of  looking  at  /ero-cro.ssing  measures  per  se  is  not  in  itself  conceptually  new; 
following  1 icklidcr's  studies,  other  i..vesligalors  |C2,C3,C4,SI  | have  looked  at  zero-crossings  and 
up-crossings.  However,  in  contrast  lo  most  of  these  other  inve.sligilors  |1I,M3,RI ) who  have  used 
zero-crossings  to  analyze  speech,  wc  do  not  average  the  up-erossings  over  a fixed  interval  of  lime. 
Reasons  for  this  will  he  discussed  shortly.  First  of  all,  it  is  important  lo  he  aware  that  the  chief 
motivation  for  most  up-crossing  studies  has  been  in  scarehing  lor  an  inexpensive  way  lo  find 
frequency-domain  aeoustie  features  such  as  formants  |DI,M4.PI|  ihe  zero  crossing  mclhoils 
avoided  the  compulations  required  for  Fourier  transforms,  for  .'xamplc.  In  order  lo  decrease  the 
expense  and  variability  in  examining  individual  cycles  (and  in  fact  no  previous  claim  had  been 
made  attaching  any  significance  to  individual  cycles),  it  was  easy  to  compute  an  average  cycle 
length  by  amply  counting  the  number  of  zero-crossings  occurring  during  a given  lime  interval. 
This  procedure  has  two  major  consequences;  I)  the  perfect  lime  resolution  inherent  in  the 
lime-domain  is  lost  when  the  crossings  are  averaged;  that  is,  a bandwidth  limitation  is  inlroduecJ, 
2)  the  conventional  acoustic  features  extracted  are  usually  less  precise  and  more  variable  than  the 
same  features  extracted  directly  in  the  frequency-domain.  More  recently,  the  techniques  of  linear 
.nrcdiction  have  been  popularized  |AI,MI,M2J.  Linear  prediction  consists  of  finding  the  coeffi- 
cients for  a linear  filler  which  minimizes  the  least  .squared  prediction  error,  averaged  over  a given 
analysis  interval  The  non-sialionary  formulation  of  lii-ear  preiliction  is  a lime-domain  technique 
(CIl.  However,  the  analysis  interval  or  window  used  (ihot  d.  not  inherently  determined  hy  the 
technique  itself)  generally  consists  of  a 10  msec  dur..lion  or  a pilch  peritnl.  I lierefttre,  here  also. 

Our  reason  for  m^i  averaging  up-crossings  generally,  is  that  in  Ihe  speech  waveform  itself  there 
are  significant  acoustic  features  which  last  for  only  one  or  a few  cycles  in  duration.  If  such  cycles 
arc  averaged  in  with  others,  this  information  is  irrevocably  lost.  As  previoi  sly  mentioned,  such 
transient  events  frequently  occur  at  phone  boundaries  as  well  as  helween  other  acoustically  distinct 
regions,  within  slop  consonants,  for  example.  We  do  consider  it  appro|)rialc  however,  lo  average 
our  time-domain  parameters  across  regions  which  arc  acoustically  imiform.  such  as  the  friealion 


C haplcr  i - 11  IK  I IMK-DOMAIN  AND  SI'KKC'II 


Pago  H 


region  of  a fricalivc,  exclusive  of  any  bou’nlary  iransilion  evenls.  'I'hcrefore.  when  we  refer  lo  an 
"average"  paranieler  value,  only  the  parameters  of  cycles  within  such  a uniform  aeouslic  region  are 
averaged  together. 

The  second  motivation  for  this  work  eoincs  from  neurophysiological  research  on  the  auditory 
information  processing  of  the  ear  itself  (Dl.FI.CI .K2,K3,K4|.  The  Information  conveyed  in  an 
iheoming  signal  is  eneoded  hy  different  kinds  of  neurons  tn  the  ear  and  then  transformed  and 
integrated  at  higher  neuronal  levels.  Eaeh  of  the  ncuroas  communicates  information  in  a very 
s inple  form;  namely,  by  propagating  a sequence  of  electrical  iiDpulscs  or  spikes,  each  of  which  is 
the  same  voltage  or  amplitude.  This  one  dimensional  response  is  referred  to  as  "all-or-none". 
Currently,  the  general  consensus  of  auditory  neurophysiologists  reeogni/.cs  that  the  car  codes 
different  aspects  of  an  auditory  signal  both  spcclraily  and  temporally;  that  is,  in  both  the 
frequency-domain  and  the  time-domain,  respectively.  The  frequency-domain  analy.sis  performed 
by  the  car  is  analogous  to  that  performed  hy  a filter  bank.  Different  neurons  along  the  basilar 
membrane  respond  to  different  frequency  rangc.s;  that  is,  a neuron  fires  if  it  delects  a signal  of 
sufficient  intensity  within  a given  frequency  range.  Neurons  also  code  information  in  the  lime- 
domain,  in  a fashion  known  as  "phasc-locking"lK l,R2].  Given  a representation  of  the  waveform, 
a phase-locking  neuron  responds  by  firing  once,  phase  consistently,  foi  each  cycle  or  integer 
multiple  number  oi  cycles  within  the  waveform.  Phase-locking  occurs  for  signal  frcqueneic.s,  from 
the  lowest  audible  frequencies  up  to  at  least  4.5  kHz.  and  perhaps  above  7 kHz.  This  frequency 
range  encompasses  the  chief  information  bearing  portion  of  the  speech  spectrum.  It  is  very  likely 
that  the  ear  has  evolved  such  that  both  ;hc  lime-domain  and  frequency-domain  information 
derived  from  acoustic  signals  is  integrated  and  utilized  syncrgislically. 

The  analogy  between  infinite  peak-clipping  and  pha.se-loeking  shoidd  now  become  clear.  Both 
are  very  similar  lime-domain  transformations  of  the  acoustic  signal  except  that  the  former  method 
temporally  characterizes  two  phase  consistent  aspects  of  the  wavefoini  whereas  the  latter  ebarac- 
lerizes  only  one.  We  have  previously  noted  that  the  kinds  of  inforir  m obtained  from  up- 
erossings  and  down-crossings  for  any  given  signal  are  highly  reduouant.  The  lime-domain 
techniques  we  have  developed,  directly  examine  the  information  available  from  infinite  peak- 


C hapliT  ! - Till-;  1 IMI  -DOMAIN  AND  SI'I  I.CII 


clippint:  and  phase-locking.  In  addiiion,  we  have  examined  i - useriilne.ss  of  eerlain  mher 
liine-domain  parameiers  derived  on  a cycic-by-eyele  basis. 

The  Cimc-domain  panmeicrs  which  we  have  found  lo  be  useful  are  de.seribed  for  a eyeie  where 
II  IS  Us  inuial  up-erossing.  i2  is  iis  down-crossing  and  i3  is  iis  end  or  ihe  lime  of  ihe  ncxl  wave- 
form up-erossing.  I lie  data  -s  sampled  al  20  kl  I/.  Therefore  ihe  aceiira.-y  of  ihe  individual  sanmie 
IS  50  microseconds.  In  order  to  delermine  that  a waveform  up-crossing  has  occurred,  ihe  cycle 
musl  have  already  achieved  a negalive  amplitude  value  less  than  minus  epsilon  (where  epsilon 
generally  has  been  chosen  to  he  equal  to  3.  out  of  a range  0-25.5)  and  where  the  most  reeeni 
sample  has  a positive  value  greater  than  positive  epsilon.  Then  a linear  iiilerpolalion  is  performed 
between  this  sample  and  the  Iasi  negalive  sample  in  order  lo  more  accurately  aseeriain  the  true 
lime  of  up-crossing.  I he  period  (I*)  oi  a eyeie  cqu.,ls 


(I)  F = 13  -ll 


and  Ihe  cyclc-freq'iency  (Cl  ) equals 


(2)  CF  = I F 


Peak  ampliliid.  lor  a cycle  is  simply  the  maximum  positive  ampliinde  observed.  Amax.  during 
dial  cycle.  .Similarly,  valley  amplitude  for  a cycle  is  ihe  most  negalive  ampliinde  obseived.  Amin, 
during  lhal  cycle.  Absohile  ampliinde  (Absamp)  equals 


(3)  Absamp  = Aniax  - Amin, 


The  parameiers  total  variation  and  microslrueiurc  are  boih  measures  of  cycle  "smooihness". 
The.se  indicate  the  presence  of  higher  frequency  components  o.  uffieienl  ampliinde  lo  "ride"  on  a 

lower  frequency  "carrier"  cycle  wiihoul  causing  up-crossings  of  iis  own  (except  near  the  /ero- 
crossmgs  of  ihe  carrier  cycle).  The  lolal  variation  (TV)  equals: 

(T)  I\  — .1(1)  - a(l- 1 ) I / ^ I ;i(l)  I where  n =#  .samples  in  lhal  evele. 


(■|i:i|ikr  I - l ili:  I IMl  -'iOMAIN  AND  SIT.I.C  II 


I’ugc  M) 


The  microslrucUirc  (MS)  equals 

(5)  MS  = (i:,.,  Ja(t)-a(t-l)|  - 2*Absamp)  / i:,.,  J a(l)  ( 

wiiLre  i(  should  be  noted  ili:ii  2‘*Ahsanip  is  the  un-norniali/ed  TV  of  a sine,  triangle  or  square 
wave 


I un-normali/ed  TV  = | a(t)  - a(t-l ) j ). 

For  example  in  the  time'domain.  these  latter  two  features  provide  a major  distinguishing 
feature  between  the  high  front  vowel  /i/  and  the  high  back  vowel  /u/.  The  phone  /i/  is  generally 
characteri^cd  by  much  higher  values  of  TV  and  MS  than  is  /u/. 

We  have  developed  a display  which  visually  captures  the  information  contained  in  either  the 
up-crossings  or  down-crossings  of  infinitely  peak-clipped  speech  as  well  as  the  information 
conveyed  by  a neuron  phase  locked  to  the  speech  waveform  We  call  this  display  a "Log  Inverse 
Feriod"  or  "LIP"  plot.  O ieLially  this  display  was  suggested  by  Lettvin  who  has  designed  an 
analog  circuit,  the  CLOOGT,  (Continuing  Log  Of  On-Going  livents),  which  detects  both  wave- 
forir.  up-crossings  and  cycle  amplitude,  and  then  displays  these,  in  real  time,  on  an  oscilloscope 
screen  which  may  he  continuously  photographed.  This  display  is  derived  from  the  "instantaneous 
frequency"  plots  of  single  unit  activity,  used  by  neurophysiologists. 

Ihe  initial  investigations  of  these  time-domain  procedures,  using  this  real-time  analog 
hardware,  were  conducted  by  the  author  in  collaboration  with  Lettvin  in  his  laboratory  at  M.l T. 
I (owever.  the  LIP  plots  and  most  of  the  t'me-domain  information  pre.sented  here,  are  the  results  of 
speech  data  digitally  sampled  at  2UkHz  and  processed  on  a PDP-11)  in  the  Computer  Science 
Department  at  Carncgie-Mellon  University.  At  the  time  of  recording,  the  signal  was  hand  pass 

I:  The  author  was  a research  affiliate  (1969-72)  of  the  Re.searcli  Laboratory  of  Electronics. 
M.I.T.,  and  worked  with  J.  Y.  Lettvin  (professor  in  the  departments  t f electrical  engineering  and 
biology). 


Chupicr  I - THE  TIME-DOMAtN  AN!)  SIM  J'.C  II 


I’ugc  I I 


fillcrcd  between  about  100  M/  (for  attenuation  of  60  »!/  el.'ctrical  hum)  and  about  « kHz  (to 
prevent  aliasiioj  above  the  Nyquisl  frequency  of  10  kHz,  given  a sampling  frequency  of  20  kHz). 

We  generate  our  visual  display  as  follows:  a zero-axis  is  drawn  horizontally  through  the 
acoustic  waveform.  We  note  the  exact  time  when  the  waveform  cros.scs  this  axis  in  an  upward 
direction.  Only  those  up-crossings  are  recorded  which,  following  a negative  excursion  of  the 
waveform,  then  exceed  an  amplitude  threshold,  epsilon,  set  slightly  above  the  zero-axis.  This 
threshold  tends  to  preclude  very  low  amplitude  background  noise.  We  measure  each  interval 
betwc.'n  sueccssive  up-cro.ssings  and  plot  these  as  a function  of  time  in  our  displays.  Therefore 
each  up-crossing  in  the  acoustic  waveform  is  represented  by  a discrete  dot.  We  plot,  on  a 
logarithmic  scale,  the  inverse  of  the  interval  between  succc.ssivc  up-crossings;  that  is,  the  reciprocal 
of  the  cycle  period,  along  the  y-axis,  and  time  along  the  x-axis.  This  yields  a display  which 
superficially  resembles  a kind  of  spectrographie  display.  We  also  display  a rough  intensity  mca.sure 
by  means  ot  a z-axis  modulation.  The  si/e  of  a dot  rcprc.scnting  a given  cycle  is  proportionate  to 
the  log  of  the  greatest  amplitude  obscived  during  that  cy-U:.  This  ilot  size  inter.sity  measure  of  our 
TIP  displays  is  analogous  to  the  intensity  measure  cxpre.s.scd  in  spectrograms.  The  following 
illustration  (I  ig.  I)  shows  the  relationship  of  the  log  inverse  period  plot  to  the  waveform  from 
which  it  is  eenerated.  Note  that  individual  cycle-frequency  values  may  be  easily  read  from  the 
y-axis.  b ,,c  LIP  plots  which  are  used  to  illustrate  the  following  chapters,  the  time-scale  has  been 
compressed  more  than  in  this  graphical  illustration  of  the  LIP  plot.  Iherefor-;,  even  though  dots 
may  overlap  or  appear  to  occur  simultaneously,  they  actually  are  cccurring  at  discrete  times. 


Log  inverse  period 


Figure  1 : The  Log  Inverse  Period  Plol 


( hapur  II  - tllARACI  I-  Rlsril  S ()|-  SPKKC  ll  IN  I III;  I IMK-DOMAIN 


I'a.uc  n 


INTRODUCTION 

Hic  grcalcsl  polciilial  value  for  lime-domain  analysis  of  speech  rests  with  how  well  it  can 
detect  and  charaeleri/c  acoustic  transients.  As  previously  described.  Hies;  arc  principally  found  at 
certain  phone  boundaries  as  well  as  at  other  acoustically  distinct  legions,  such  as  within  stop 
consonants,  for  example.  For  this  reason,  we  have  chosen  to  coneenlrale  on  the  characlerislies  of 
fricatives  and  slop  consonants. 

This  is  a large  undertaking  not  only  because  the  linie-doniain  analyses  we  are  using  arc 
radically  different  from  the  classical  fregucney-doinain  leehniqiies,  but  also  because  we  arc 
searching  through  a inuch  larger  data  base  than  is  customarily  used  in  order  I)  to  identify  reliably 
relevant  elemental  time-domain  eliaraeteristic.''.  of  speech,  and  2)  to  ascertain  the  significance  of 
these.  The  total  amount  of  data  examined  in  detail  during  the  course  of  these  studies  consisted  of 
many  hendreds  of  utterances  derived  from  large  sets  of  nonsense  syllables,  citation-form  phra.;'’s, 
and  complete  .sentences  of  connected  speech,  spoken  by  more  than  20  male  and  lemale  speuKcr.-., 
including  adults  and  children,  often  in  noisy  environments.  The  subset  of  this  data  which  has  been 
•Studied  most  thoroughly,  consists  of  6X4  phrases  in  citation  form,  generously  provided  by  J.  Shoiip. 
Each  of  the  speakers  (2  males,  I female)  spoke  228  utterances  ch  sen  to  provide  examples  of  most 
of  the  allophones  of  the  fricatives  and  .hop  consonants,  including  those  common  in  general  English 
as  desc,'’ibed  by  .Shoup  |S3|.  I'urliier  descriptions  of  this  set  of  dat;i  will  be  provided  in  Chapter  IV 
which  discusses  alluphonic  tliflerences  in  the  fricatives  and  stop  consonants.  In  searching  for 
meaningful  aspects  of  the  acoustic  signal  encoiled  by  time-domain  parameters,  wc  have  been 
guided  by  an  undcrstaiuling  of  the  primary  principles  of  single-unit  stimulus-response  charaeieiis- 
tics  in  the  nervous  system.  Operationally  this  has  meant  careful  study  of  acoustic  regions  wlicre 
sharp  discontinuities  consistently  occur  along  one  or  more  dimensions. 

The  present  chapter  discusses  the  more  ubiquitous  acoustic  phenomena  revealed  by  our 
time-domain  analyses.  First  is  a discussion  of  the  canonical  forms  ol  fricatives  and  slop  conso- 
nants. next  a description  of  some  related  aeouslie-phonological  phenomena,  and  finally  some 
observations  on  lime-domain  characteristirs  of  the  other  phones.  Uoih  LIP  plots  and  waveforms 
are  provided  to  illustrate  each  of  the  phenomena  diseu.ssed.  It  is  highly  recommended  that  the 


Chapter  II  - CHARACTERISTICS  OF  SPEECH  2N  THE  I IME-DOMAIN  Page  14 


I 

reader  spend  some  time  earefully  examining  these  in  order  to  understand  the  relationships  of 
information  representation  in  eaeh  of  these  displays.  I he  reader  is  advised  to  remember  that  the 
LIP  plot  is  a visual  representation  of  precise  cycle-frequency  information  along  with  some 
amplitude  information.  However,  in  the  description  of  the  acoustic  features  of  the  phones, 
characteristic  relative  changes  in  cyele-frcquency.  amplitude,  and  microstructure  measures  will  be 
stated  as  well. 

Each  line  of  waveform  has  a duration  of  .1  see.  The.se  lines  read  left-to-right,  irom  top-to- 
bottom,  are  consecutive.  The  begi  iniiig  and  end  times  of  each  complete  waveform  arc  designated 
by  arrows  on  the  time-axis  of  the  corresponding  LIP  plot.  On  some  LIP  plots,  vertical  lines  are 
drawn  as  segmentation  Imundary  markers  between  different  acoustic  states.  In  thsc  plots,  phone 
labels  for  the  diffc.cnt  segments  arc  printed  along  the  time  axis  Since  there  exists  some  back- 
ground noise  for  most  of  the  speech  shown  here,  waveform  up-crossings  normally  occur  during 
intervals  of  "silence  ".  The  LIP  dots  representing  this  noise  are  generally  small,  reflecting  the  low 
amplitude  of  such  cycles.  During  intervals  of  speech,  this  background  noise  is  superceded  by  the 
greater  amplitude  of  the  s.nccch  signal.  In  order  to  give  the  reader  a more  concrete  concept  of  the 
parameters  measured,  single  examples  of  characteristically  typical  phones  will  be  described  with 
some  quantitative  details.  In  addition  the  frequency  of  occurrence  for  certain  time-domain 
observed  features  of  speech  will  be  given  at  the  conclusion  of  this  chapter. 

FRICATIVES 

The  set  of  frieatives  studied  consists  of  /v/,  /f/.  /?/,  /0! , /ij . /s/,  ^1 , and  If/ . Generally 
the  fricatives  arc  acoustically  characterized  by  sustained  high  frequency  regions.  In  voiced 
fricatives,  this  high  frequency  region  is  preceded  by  a low  frequency  region,  the  familiar  voice-bar, 
which  may  persist  throughout  the  high  frequency  region  as  well.  Time-domain  analysis  reveals  that 
at  the  beginning  of  the  high  frequency  region  of  the  fricative,  there  arc  very  sharp  discontinuities 
occurring  simultaneously,  upward  for  both  cycle-frequency  and  microstructure,  and  often  a sharp 
decrease  in  amplitude  where  the  fricative  is  preceded  by  a vowel.  The  new  acoustic  state  which 
results  from  these  large  changes,  is  usually  sustained  for  most  of  the  fricative  duration.  Usually  at 


Chapter  II  - CIIARACTERISI  ICS  OF  SPEECH  IN  THE  TIME-DOMAIN  Page  15 

» 

the  end  of  the  fricative,  sharp  discontinuities  wiih  respect  to  eycle-frequcncy,  amplitude,  and 
mierostructure,  arc  again  observed,  at  the  boundary  between  the  fricative  and  the  following  phone. 
However,  a different  transient  kind  of  acoustic  event  often  occurs  at  the  very  beginning  and  again 
at  ihe  very  end  of  the  fricative.  Sometimes  occurring  at  these  places  is  one  or  a few  cycles 
characterir.ed  by  lower  cycle-frequencies  than  those  of  the  other  cycles  in  the  acoustic  segment 
immediately  preceding  and  in  the  acoustic  segment  immediately  following  this  transitional 
phenomenon.  Amplitude  of  these  cycles  is  variable,  although  the  cycle  microstructure  is  usually 
low.  These  transition  cycles  arc  marked  "t"  in  the  LIP  plots.  Regions  of  frication  are  marked  "f", 
and  for  voiced  fricatives,  the  initial  voicing  region  is  marked  "v".  The  first  example  (Fig.  2)  is  an 
/s/  from  the  utterance  "there  sir"  (HN.*?).  The  duration  of  the  frication  region  is  .17  sec.  The 
cycle-frequency  of  the  transition  cycle  into  the  frication  is  201  Hz  and  the  cycle-frequency  of  the 
transition  cycle  at  the  end  of  the  frication  is  304  Hz.  Within  a few  cycles  of  the  initial  transition 
cycle,  the  average  amplitude  of  the  cycles  in  the  fricated  region  drops  to  a fraction  of  its  value 
prior  to  the  transition,  while  the  cycle-frequency  and  mierostructure  measures  sharply  increase. 
The  same  kinds  of  changes  occur,  although  in  the  opposite  direction,  as  the  frication  abruptly  ends 
at  the  final  transition  cycle,  and  the  following  vowel  commences. 

The  second  example  (Fig.  3)  shows  the  voiced  fricative  /v/  in  the  utterance  "invent"  (EH,d). 
In  this  utterance,  there  occurs  first  a transition  cycle  with  a cycle-frequency  of  154  Hz  follov/ed  by 
four  cycles  of  voicing  (average  cycle-frcqucncy  for  this  acoustic  region  is  206  Hz),  which  precede 
the  /v/  frication.  This  fric.ntion  lasts  for  42.3  msec  and  is  terminated  by  two  end  transition  cycles 
with  cycie-frequcncics  of  375  Hz  and  361  Hz,  respectively.  Here  the  initial  sharp  discontinuities, 
upward  for  cycic-frcqucncy  and  mierostructure,  and  downward  for  amplitude,  commence  not  with 
the  initial  voicing  portion  of  the  /v/  but  with  the  onset  of  frication.  This  is  typical  of  a voiced 
fricative.  And  here  too.  sharp  discontinuities  in  the  opposite  direction  occur  within  a few  cycles  of 
the  final  transition  cycle. 

Example  representations  follow  for  each  of  the  remaining  fricatives.  The  phones  and 
utterances  are  /i/  (Fig.  4)  in  "Display  the  phonemic"  (LMJ?),  1^1  (Fig.  5)  in  "to  thaw"  (EH,^, 
/z/  (Fig.  6)  in  "hers  earns"  (HN^,  /f/  (Fig.  7)  in  "the  feed"  (JA,«^,  ft!  (Fig.  8)  in  "rouge  was" 
(HN$),  and  IJ)  (Fig.  9)  in  "her  shirt"  (EH.<?), 


quc  iicy 


C lwpltr  II  - niAKACII  RISIK  S OI  H I IX  II  IN  I III.  HMK-DOMAIN 


Frequency  (It/) 


>cif  l-requency  (Hz) 


Clmpu-r  II  - C II  \K  \C  I I^KISI  It’S  ()!•  .V'l.KCll  IN  I III  I IMIl-DOMAIN 


Figure  6:  hers  earns  (9,HN) 


l ime  (ncc) 


Log  Cycle  Frequency  (H/.) 


Cliaplirll  -(  IIAKA(.  1 1 KISIIC  S ()l  SI  I.IXII  IN  llli;  lliMK-DOMAIN  Page  21 


t 

} . 


Figure  7:  the  feed  (cf,JA) 


Time  (sec) 


( [I  - i II ARAt  »l  RI'.IU  SOI  ! iM.I.ni  IN  Mil.  I IMI  -DOMAIN  I’aic  22 


Uiapler  II  - UIAKAC  I tKISI  ICS  Ol-  SI'CIXII  IN  I III.  I IMI  .-DOMAIN 


I 


Chaptir  II  - ( IIAItAC  II  WlSi  KS  OI  M’KIX  II  IN  I III.  I IMI, -DOMAIN  ! ape  25 


STOP  COS'SOi\ANrS 

The  stop  consonants  arc  the  set  / p/.  /t/.  /k/.  /b/,  /d/,  and  /p/.  Acoustically,  stop 
consonants  typically  have  a pause  portion  followed  by  a liipher  frequency  repion  which  represents 
the  stop  consonant  release  portion,  plus  aspiration,  if  present.  A voiced  stop  consonant  has  a low 
frequency  voieinp  repion  just  precedinp  the  pause  portion.  Often  these  lower  voicinp  frequencies 
arc  sustained  throuphout  part  or  all  of  the  release-aspiration  repion  as  well.  It  is  not  uncommon 
for  the  pause  sepnient  t!>  be  completely  omitted  in  a voiced  stop  eonsrniant. 

As  the  waveform  trtmsitions  from  prior  context,  or  the  initial  soieinp  repion  in  voiced  stop 
consonants,  into  the  pause  portion,  the  cyclc-fre(|uency.  amplitiKle.  and  microstructure  measures 
sharply  decline.  1 his  patrse  portion  consists  of  only  one  or  a few  cye'es,  aiul  the  cycle-frequencies 
are  quite  low.  usually  less  than  100  H/..  I his  abrupt  drop  in  cycic-frequcncy  is  visually  quite 
apparent  in  the  UP  plots.  Since  the  pause  cycles  are  often  of  very  low  amplitude,  and  therefore 
arc  represented  by  very  sn’.all  riots  on  the  LIP  pU)ts.  we  have  visually  enhaneeil  tnem  by  automatic 
replacement  of  them  with  an  asterisk  symbol.  Next,  as  the  waveform  tiansilions  abruptly  into 
the  release-aspiration  repion,  both  eycle-frequeney  and  inierosirneture  increase  sharply  as  docs 
amplitude,  which  nonetheless  .n  its  peak  value  pencrally  remains  well  belov  the  amplitude  value 
for  stressed  vowels  and  most  unstressed  vowels.  Where  aspiration  is  eleai  ly  present,  the  transition 
from  release  to  aspiration  is  usually  quite  smooth,  ihouph  often  with  tiie  cycle  frequency  and 
amplitude  values  pradually  deeicasinp. 

In  the  Lll’  plots  shown  here,  pause  cycles  are  marked  "p  ‘nc  lelcase-aspii alion  repion  by 
"r",  and  the  initial  voiced  repion  of  voiced  stop  consonants  by  "v".  The  followinp  example  is  of 
the  /[/  (Fip.  Id)  in  the  utterance  "the  till”  (Lllf’)-  Here  the  pause  cycle  has  a cyele-lreqiiency  of 
82  II/.  The  release-aspiration  portion  has  an  averape  eyclc-freqiiency  value  cxeccdinp  4()()()  II/. 
and  a duration  of  68  3 msec,  which  is  loiiper  than  is  usually  round  in  connected  speech. 


i|Htr  il  - t IIAUA(  I l.kiSlICSOI  ? !’i:i  I II  IN  I III  I IMI. -DOMAIN 


I'.i.'.-  .1(1 


Reproduced  from 
best  available  copy. 


nil 


P 




r 

l ime  (mx) 


Hp.ure  10:  Ihe  till  (ct,EH) 


ClKiplcr  il  - CIIAIt  AfllRISncS  OI  SI’KIXII  IN  llli;  I IMK-DOM  AIN  I’jpc  27 


riiiK'-JonKiin  uii;il>sis  jlso  icuals  ilic  cxiMcno;  of  several  more  siiblle  aeoiislie  plieiioniena. 
These  pheiioiiier.a  are  ollen  botli  short  in  duration  and  low  in  amplitude.  They  occur  at  phone 
boundaries  and  last  for  only  one  or  a few  eycics  in  the  acoustic  waveform. 

The  first  of  these  is  analogous  to  the  transitional  eycles  previously  deserihed  lor  iriealives.  At 
the  e id  of  the  release-aspiration  region  of  the  stop  eonsonanl,  there  is  often,  though  not  always, 
one  or  a few  eycles  which  have  lower  eyele-freqiiencies  than  any  of  the  e'.her  eycles  in  cither  of  the 
acoustic  scgineiits  immediately  preceding  and  follovving  this  acoustic  event.  These  transitional 
segments  aie  marked  as  "t"  in  the  LIP  plots  which  follow. 

The  seeoiiil  phenoiiieiion  shall  he  referred  to  as  a "stop  preview".  In  the  case  of  a slop 
consonant  which  is  pieeeileil  by  a vowel  (and  sometimes  by  olhe  phone  types  as  well),  the  very 
end  of  the  vowel  is  sometimes  eli;iraeteri/ed  by  one  or  two  eyelcs  which  have  much  hicher 
cycle-frequency  values  than  any  of  the  other  eycles  which  comprise  the  vowel.  These  slop 
previews  are  usutdiy  very  low  in  ainpliiudc.  Their  tliiralion  is  almost  always  less  than  I msec  and 
commonly  less  than  .f>  msec.  In  the  LIP  plots,  these  are  markeil  as  "sp".  I he  physical  cause  for 
these  slop  previews  is  not  known  It  is  possible  that  they  eonslitiite  a slop  "closure"  phenomenon 
resulting  fiom  high  Irequeney  turbulence  as  the  artieul.itors  close  in  preparation  for  the  stop 
eonsonanl. 

The  Ihiid  phenomenon  eoncerns  the  one  or  two  cycles  immediately  preceding  the  stop 
preview.  I hese  one  or  two  eytles  arc  usually  of  large  aniplilnde,  but  have  a lower  cycle-frequency 
value  than  any  of  the  cycles  immediately  preceding.  Only  •J  i the  very  beginning  of  the  vowel  are 
there  other  cycles  with  cycle-frequencies  as  low  or  lower  than  the  cycles  imniedialely  preceding  the 
slop  preview.  T hese  "slop  preview  transitional  cycles"  are  sometimes  omitted  when  the  slop 
preview  is  present.  T hey  are  marked  "spl"  in  the  LIP  plots. 

Illuslralive  examples  of  all  these  plienomena  are  provided  in  the  ulteranee  "to  do"  d ig.  I I ), 
(IIN.9).  Here  the  eyele-frequency  of  the  stop  preview  iransiiioii  is  2')<>  ||/  and  the  two  cycles 
comprising  the  stop  preview  have  eyele-frequcncies  of  253S  11/  and  2h').S  ||/,  respectively.  The 
duration  of  the  stop  preview  is  .77  m.scc.  Immediately  following  is  the  voicing  region  with  an 
average  eyele-frequeney  of  less  than  200  II/.  and  a duration  of  70. .T  msec.  The  release-aspiration 


Chapter  II  - Cl  lARAC  I KRISIICS  OF  Sfi;iXII  IN  I II!-  HMF-I)()MAIN  2X 


norlion  of  the  /a/  has  an  average  cydc-frciiucncy  cxcccilinp  1500  11/  and  a duration  ol  only  16. X 

msec.  Two  transition  cycles  mark  the  boundary  between  the  /d/  release  and  the  followine  vowel. 
I'heir  eyele-lreijiicneies  are  respectively.  .125  11/  and  151  II/..  All  of  these  abrupt  ehanj^es  in 
cycle-freijucney.  amplitude,  and  mierostnieture,  resolve  the  stop  consonants  into  several  unambi- 


guously distinct  acoustic  regions 


( Iiapicr  II  - I ll/\R,U  I KRLV 


lies  ()i  spKtcii  IN  nil.  I imi:-i)()main 


Time  (see) 


Figure  11;  to  do  (9,HN) 


(Ihiplcr  II  -rilAK,\(  II  UlSIlt  SOI  II  IN  nil.  II.MI  -domain  I'ii.uc  .10 


I he  following  1,11’  plots  prositic  rc|»ivseiil;ilio  is  of  i;;ith  of  the  other  stop  v'oiisoiniiUs.  In 

these  the  different  ueonsties  ;iie  iiuliviilutilly  marked  but  are  analogous  to  those  prexiously 
designated.  I hese  UP  plots  show  the  /p/  (Fig.  12)  in  "display  the”  (I  .M.  ).  the  /h/  (Fig.  13)  in 
"the  baek  lefthand"  (KC.‘  ).  the  /k/  (I  ig.  14)  in  "has  whitloekite"  (KK,  ).  and  the  /g/  (Fig.  I.‘>) 
in  "he  grows"  (I  IN.' ). 


Lo"  Cycle  I'rcqueney  (H/.) 


Uiiipltr  II  - i IIAKA(  II.KISIK  .SOI  .‘  ‘'KKC  II  IN  Mil  I l\ll  -l»(»\l  MN  l’.ii;e  32 


'I  inic  (see) 


Figure  13:  the  back  lefthand  {c'.Z) 


Log  Cycle  Frequency  (Hz) 


C liapler  II  - 1 ilAKAC  I KRIS  I ICS  OK  M'KKC  1 1 IN  II  IK  I IMK-DOMAIN  I’age  .14 


^^JVVVVVVVVVVVV^  ^ 

sp  V ..1.1  I 111  I 


* % V**  ^ 

• - ^ •• 


•- 


% # ifc  • • 

s*  * ' 

nJ-  "*'%  —•-r.r/l-V.SSn^*-. 

••  - ^ • -T«  • 

U V V ^ -'V-*  ■ 


X*  ..  • " 

' • • . . . 

•’  spt  • t 


'1 :l_ 


i/>l  to 


:L___J— 


Time  (sec) 


Figure  15:  he  grows  (9.HN) 


C haplcr  II  - CIIAKAt  ll,KISriLS  01  ‘.PI-IX  II  IN  I III:  I I.MK-DOMAIN  I'a,  c 35 


ACOVSTIC  PHONOLOUCAL  PHENOMENA 


There  arc  a variety  of  acowsiic  phonological  phenomena  which  are  commonly  observed  with 
time-domain  analysis.  Generally  these  phenomena  are  readily  apparent  in  both  the  waveform  and 

LIP  plot.  However,  especially  when  .such  acoustic  events  arc  either  very  brief  in  duration  or  low  in 
amplitude,  or  both,  their  existence  is  often  much  more  visually  evident  in  the  LIP  plots. 

One  very  eoninmn  phenomenon  is  the  ease  where  a fricative  is  eharaeteri/ed  by  a central 
region  where  the  cyele-frcqueneies  arc  lowered  in  relation  to  that  phone's  eharaeteristie  frequency. 
This  central  region  may  result  from  articulatory  changes  which  oeeiir  in  preparation  f«>r  articulation 
of  the  following  phone.  In  the  following  example,  the  phone  of  interest  is  a rounded  /f/  (Tig.  lb) 
in  the  utterance  "no  foe"  (HN.S’f.  Here  the  average  cycle-frequency  value  for  the  initial  frieated 
region  is  1079  Hz.  for  the  central  region  is  093  Hz.  and  for  the  final  frieated  region  is  I I4K  Hz.  In 
addition,  the  first  frieated  region  is  much  greater  in  amplitude  than  the  central  and  final  frieated 
regions,  whieli  are  about  equal  in  amplitude.  Two  nntre  examples  follow  with  the  A7  (f-ig.  17)  in 
"pyroxinc"  (HH.'  ) and  the  second  /f/  (l-ig.  IX)  in  "frmn  left  to"  (KH.-  ). 


Log  Cycle  I'rcqucncy  (Hz) 


t liii|iUT  II  - ( IIAK  U II.LISI  K S (II  ^-i•|  |,(  II  IN  I III  1 1 Ml. -DOM  AIN 


l';iue  ,l(i 


1,1 


'ly  I 


|lj  ,A/^  A'A'A'M'yV'^'A 


V*'r*IW'*/\%A%/WVWk>vsr^ -■ 


a 


IQj 


l ime  (see) 


Figure  16;  no  foe  (9.HN) 


(.  liii|;U<r  II  - < ||.\|{A(  1 1 ItlM  K s ()l  > . i.i:c  II  IN  ||||  I IM|.-|)()M.\!,\ 


Figure  17:  pyroxine  (0,B6) 


l ime  (sec) 


( liiipu-r  II  - ( IIAKA(  II.KISriC.SOI'  II  IN  nil.  Il\li:-IX)MAIN 


I'hc  next  phenomenon  eoneern.s  llie  issue  of  (lie  aeouMie  eoirelales  of  wiril  are  eommonly 

referred  lo  as  "unreleased  " slop  consonants.  Time-domain  analysis  reveals  lhal  many  of  Ihe  slop 
ccnsonanls  which  are  phonetically  iranseribcd  by  liiieuisis  as  "unreleased"  or  "deleted",  are  belter 
described  as  "niinimally  released".  I'hese  are  aeouslieally  eliaraeieri/eil  by  the  usual  pause 
eyelc(s),  followed  by  a very  brief  scfimenl  <»f  high  eyele-frequeney  energy  which  is  analogous  to  a 
normal  release-aspiration  segment  except  for  its  short  duration,  aiul  whieli  is  sometimes  followed 
by  the  iransiiion  eyele(s)  leading  into  the  next  phone.  This  very  brief  release-aspiration  segment 
consists  of  only  a few  eyeles,  often  just  one  or  two  cycles,  where  the  entire  di..alion  ol  this  portion 
ranges  from  less  than  I msec  to  more  than  6 msec.  I he  temporal  secpienee  of  aeouslie  events 
characterizing  these  minimally  released  stop  consonants  is  essentially  identical  with  that  for 
normally  released  slop  etinsonanls,  except  for  durational  aspects.  The  lew  eyeles  with  high 
cycle-frequencies  remaining  in  minimally  releasetl  slop  consonants  are  generally  insufficient  for 
reliable  identification  ol  the  slop  consonant.  However,  the  information  lhal  a slop  consonant  has 
occurred,  and  whether  or  not  it  was  voiced,  docs  remain  in  mosi  eases.  I he  following  example 
shows  such  a minimally  released  stop  consonant,  the  /d/  (I'ig.  I'J)  in  "would  give"  (I IN,';).  Here 
the  release-aspiration  segment  is  comprised  of  only  2 eyv.les.  with  a total  iluralion  of  1.4  msec,  and 
is  preceded  by  a normal  voicing  region. 


The  following  LIP  plots  provide  several  more  examples  of  this  same  phenomenon,  very 
eommonly  found  in  eonneeleil  speech,  especially  with  the  first  of  two  sueeessive  slop  consonants. 
Acoustic  observations  of  a very  brief  slop  consonant  often  indicate  lhal  another  slop  consonant 


will  immediately  follow.  I bese  examples  show  the  /h/  (I'ig..  20)  in  "tub  look"  (L.i'  '),  the  /k/ 
(Tig.  21)  in  "speelrogram"  (I.M,  ).aiulthe/p/  (l  ig,  22)  in  "sloop  to"  (LI I,  ), 


i 

i 

I 


1 

! 


lliapur  II  -niAUAt  I IKI.SI  K S OI>  i i;i.<  II  IN  IIU.  1 1 MI.-|)()MAIN  I’apc  4(1 


T 


-J. 


C 


-I 

^ 


Figure  19:  would  give  (9,HN) 


I'inic  (see) 


Cliiipler  II  - Cll  1 1 KISI KS  OI  .‘.l*i;i  ( 1 1 IN  ||||  ||»:i 


If  f' 

;!iiv 


III 


/'i 


.A  /4  fS  m ^ . 

I}  w y V V ••/•vVivVv'nAv'-v^ 


/b/  /t/ 


:J 

Figure  2C:  tub  took  (o,EH) 


-DOMAIN 


Time  (set) 


Lot;  Cycle  Frequency  (H/) 


Figure  21:  spectrogram  (9,LM) 


1 in:e  (sec) 


Cli.iiKif  II  - CIIAKA(  IIHISIKSOI-  mKIXII  IN  ll||.  I IMI-|)(  i.MAIN 


I'liuv.'  4,1 


Figure  22;  stoop  to  (o.EH) 


1- 

I iine  (sfc) 


I 


Chaplcr  II  - C IIARAC  I IJUSI  ICS  OI  M'llIX  II  IN  I III;  I IMH-DOMAIN  Paiic  44 


Anollicr  phonological  phenomenon  rclalcs  r»  Ihe  occasional  inscrlion  of  an  cxira  slop 

consonani.  This  occurs  when  a syllable  ends  wilh  a Mop  consonanl  and  ihc  nexl  syllable  begins 
with  a vowel  (or  sonieliines  even  a liquid  or  nasal),  even  when  a Muhsianiial  interword  pause 
boundary  separates  the  two  syllables.  The  speaker  will  often  articulate  a normally  released  stop 
consonant  at  the  end  of  ll'.e  first  syllable  as  expected,  but  then  repeals  this  .same  slop  eonsonanl  in 
a briefer  less  intense  form,  when  he  begins  the  next  syllable.  We  refer  to  this  eon.sonanl  doubling 
phenomenon  as  "gemination".  When  this  happens,  it  is  not  pereepliially  obvious  to  a listener  that 
a second  stop  consonant  has  been  inserted  by  the  speaker.  Although  this  phenomenon  occurs 
frequently,  it  is  not  so  obvious  from  visual  displays  if  the  inter-syllable  or  inter-word  pause  is  less 
than  about  30  msec.  In  the  following  utleranee  d ig.  23)  "about  l.srael"  the  /l/  in  "about" 

is  repealed,  even  after  a long  intcrworil  pause  of  .18  sec,  just  before  the  initial  vowel  in  the  word 
"Israel".  In  this  example,  both  acoustic  manifestations  of  /t/  are  very  similar.  As  eompareil  with 
the  first  /t/.  the  second  <»r  inserted  /t/  has  a duration  of  38V(*  and  an  average  amplitude  of  82%. 
As  can  be  seen  from  the  UP  plot,  the  component  eyele-freqiiendes  of  both  are  very  similar, 
although  slightly  higher  for  the  second  /t/.  which  is  followed  by  a high  front  vowel,  (jenerally, 
where  the  preceding  and  following  contexts  for  such  a stop  consonant  are  very  different,  the 
acoustic  manifestations  of  that  consonant  may  also  differ  substantially  due  to  differing  effects  of 
coartieulation.  Specific  eoartieulation  effects  on  the  acoustic  eharaeleiisties  of  .stop  consonants 


will  be  discussed  in  Chapter  IV. 


L.OU  v^yeic 


Cliaplcr  II  - ( ll.\KA(  II  UlSIK  SOI  SI  KIXII  IN  I III,  I liM!.-l)0,\IAIN  I’auc  45 


'I'l'  I'M'*  I 


•—  w***— <N*yvv*- 

/I/ 


"'■f ’lil.-fl'i'l  Ht 


nntjui  r. 


/V  /!/ 


t i m ♦ 1/1  10  f' 



Figure  23:  about  Israel  (o,JB)  , 

I iiiu  isfi  ; 


•j..  .• 


I 


( hapur  II  - UIARAC  1 1 UlS I K S OIM’l  hd  I Ts  Mil  I IMI -DOMAIN  I'.itic  4(. 


I Ills  phuioiiK'iion  <)l  the  iiiserteil  slop  consonants  slioiiUI  not  he  eniilusevl  with  a 'uii  iual 

glottal  slop  wliieh  conimonly  pieeeilcs  an  ultcranec  initial  vosvel  and  dillers  suhslantially  lioni  a 
slop  eoiisonant.  In  contrast  to  the  stop  crrnsonanis,  the  reletise  of  a nonnal  glottal  stop  is  not 
eomposetl  of  a svell-defined  aiul  suslnined  high  cycle-frctpieney  region,  hut  is  ehaiaeteri/ed  hy  a 
dirfuse  region  of  mixed  cyele-frcqucney  components.  Slioup  |S,t  | has  suggested,  houever,  th.at 
this  acoustic  oeeurrence  of  a sceonu  stop  consonant  may  be  pmdueetl  hy  releasing  a glottal  ehisuie 
svhile  the  articulators  remain  at  or  near  the  same  position  as,;umetl  during  the  preceding  /t/ 
reletise.  In  oider  to  determine  the  physical  position  of  the  articulators  for  this  and  other  speech 
phenomena,  major  studies  must  he  perrofmed  which  ohscr\e  the  articulator  positions  and 
movements  with  a precise  synelironi/alion  of  the  speech  waveform 

Several  more  examples  are  shown  in  the  following  1,11’  plots  in  the  ullerances  "bertheil  all" 
(l-ig.  24,EH,(>),  "lug  more"  d ig.  2.S.I  IN.'"),  and  "phonemic  l.drels"  d ig.  2t>.l,M.  1.  where  /rl/. 
/g/.  aiui  /k/,  respectively.  ;irc  acoustieally  each  manifested  twice. 


( M - I II  \U  U II  UI.MH  SOI  : :'l  |.<  ||  in  | m | imi  -donI  \IN 


Log  Cycle  Frequency  (}|z) 


Cliapu-r  II  - (.  IIAUAC  I I.KISI  It  S 0|  .*  !*|.l,(  II  IN 


I in;  i!Mi;-i)()main  I 'age  4S 


I ime  (see) 


Uiaplir  11  - ( II  AKA(  I I.KISI  K S UlM'l  11  II  IN  INI.  IIMI  DOMAIN  I’.i.n  Ml 


llic  OiIkt  IMioih's 

Allhoiigh  our  iiivcsiipilious  have  cciilcrcd  on  llie  aeoiislie  charuelei  islies  ol  Fi  icaiives  aixl  slop 
coiisonar.ls,  we  iiole  here  {leiicral  observaiions  on  some  of  Ihe  oilier  phone  elasses.  The  eharaeler- 

isiies  eiled  oeeur  eomnionly  ihroiiphoiii  oiir  data  sets. 

VOWF.LS 

The  vowels,  espeeially  the  stressed  sowels.  are  very  laree  in  amplitude.  Cienerally  eyele- 
frequeney  and  mierosinieliire  measures  are  somewliat  variable  ihmu^hoiil  any  piven  vowel 

although  they  leml  to  he  higher  for  front  vowels,  as  would  be  expeeled.  The  1,11’  plots  of  vowels 
sometimes  appear  as  veiy  reeiil.ir  patterns,  but  often  appear  highly  irregular  The  former  case  is 
demonstrated  in  the  utterance  "to  sue"  (Fig  27,JAr  ) Here  the  regular  riouble-line  pattern 
appears  superficially  to  represent  lornutnl  structures,  although,  of  course,  that  is  not  the  case  with 
this  lime-domain  technique.  In  these  two  vowels,  each  pilch  penoil  contains  two  major  cycles  of 
different  frequeneies,  which  are  relatively  consistent  from  one  pitch  perioil  to  the  next,  hence  the 
double  line. 

Vowels  often  appear  much  more  irregular  i.i  1,11’  plots.  This  especially  oeeurs  for  high  front 
and  mid  vowels  where  relatively  large  amplitude  mieroslruclure  rides  on  the  lower  eyele-frequcney 
components.  This  means  that  the  individual  periods  of  the  major  cycles  within  a pilch  |>criod  are 
often  increased  or  deereaseil  by  one  or  more  of  the  higher  eycle-frequeney  components  riding  on 
ihem.  This  kind  of  irregular  pattern  is  exemplified  in  the  1,11’  plot  of  "the  gift"  (I  ig.  2X,JA.^’). 


( l.  iiilir  II  - ( IIAK.At  II.KlSlK  SOI  Sn.lA  II  IN  Mil.  I IMI -|)( )M Al\ 


U 


I 

1 . 
V 


I 


'Vyv^/'Y 


X / V\/V'  y*' 

y r y . » ► X * 


.*  I- 1 

•M  . , 


‘ I . I ■ 


A 


I . 


'v.«. . , I . :i  '.IX.!  ..i 

l'»  '■  v‘  ■'  "it.  I * • 


■ * ' ' ' ' . 1 1 * . • ' . I ^ 

•,  I,.,'.,' 


■ I 


Mil 

I'll 


' f j . r t , f , , t ■ • • i , J / J 

I I I I , I ( , 


1' 


27;  to  sue  (rr.JA) 


I illK-  (m  c) 


t t II  - t IIAU.U  II  UISIK  S OI  ? II  IN  I III.  I I\I1.-I)0.\|,\IN  I'.i-i.  .S 


I i(;.  2X  : the  gift  (O.JA) 


I illK'  (SI'C) 


Chaplcr  II  - CIIARACTERISI  ICS  OF  SPEECH  IN  THE  TIME-DOMAIN  Page  53 


Soniclinics,  especially  fur  vowels  of  long  duration,  a transient  event  occurs  during  the  last  half 
or  third  of  the  vowel,  l liis  event  consists  of  one  or  two  unusually  high  cycle-frequency  cycles 
which  seem  to  occur  just  prior  to  an  increase  of  irregularity  in  the  vowel  pattern  as  seen  in  the  LIP 
plot.  At  this  break  in  the  vowel  pattern,  amplitude  often  decreases  quite  rapidly.  General 
degradation  of  the  vowel  may  be  occurring  during  this  end  portion.  If  this  is  the  case,  then  vowel 
identification  would  presumably  be  more  reliable  when  based  on  acoustic  observations  of  the 
vowel  region  prior  to  such  a break.  This  phenomenon  is  illustrated  in  *he  vowel  pattern  in  both 
vowels  in  the  utterance  "the  key"  (f'ig.  29, HN.^.  The  high  cycle-frequency  dots  occurring  at  this 


vowel  break,  .ire  markeil  l<" 


Chapicr  II  - CIIARACILRISTICS  OI  SPEECH  IN  THE  I IME-DOMAIN  Page  55 


LIQUIDS 

The  acoustic  characteristics  of  liquids  have  not  been  thoroughly  studied  in  the  time-domain 

although  we  know  that,  taken  as  a class,  liquids  arc  quite  variable  with  respect  to  cycle-frequency, 

amplitude,  and  microstructure.  Of  all  the  phone  classes,  liquids  arc  hardest  to  segment  out  of  the 
speech  stream,  by  hand  or  automatically.  In  contrast  to  most  other  phones,  liquids  often  start  and 

end  quite  gradually  with  segue  regions.  I licre  is,  of  course,  no  a priori  reason  for  assuming  that 
sharp  boundaries  should  always  exi.'.t.  Two  examples  follow  which  illustrate  some  boundary 
ambiguities.  First  is  /!/  (Fig.  30)  in  "week  old"  (JA,o).  Next  is  the  /r/  (Fig.  3!)  in  the  word 
"tridymite"  (BI3,(^).  A segmentation  line  has  been  inserted  here  where  the  juncture  between  the 
/ 1/  and  /r/  was  best  approximated.  In  contrast,  we  also  find  less  ambiguous  junctures  too.  An 
illustrative  example  appears  for  the  /r/  (Fig.  32)  in  the  utterance  "week  ran"  (JA,^). 


bleUsI  si  »l  tl  3 tlblsI^N  si  »l  cl  3 lUoM  :il  si  »l 


Chapler  II  - CHARACTERISTICS  OF  SPEECH  IN  THE  1 IME-DOMAIN 


Page  56 


KJ 


Figure  30:  week  old  (o,JA) 


Time  (see) 


Log  Cycle  Frequency  (Hz) 


Chapter  li  - CHARACIERISTICS  OF  SPEECH  !N  THE  TIME-DOMAIN  Page  57 


in  n 


p* 

! 

,J 


‘ IH  n< 

1 1 1 1 1 I r 


f'l 


.V 


1 - r 


'•iV*v 


• I*  • 

■* 

.to'  •'  . • • • 


Figure  31:  tridymiie  (o,BB) 


Time  (see) 


Chapter  II  - Cl  I ARACTKRISTICS  OF  SPEECH  IN  THE  TIME-DOMAIN  Page  59 


NASALS 

Usually  the  nasals  are  eharacteri/ed  by  large  amplitude,  low  cycle  frequency,  and  low 
mierostructure.  They  arc  almost  always  very  easy  to  spot  visually  and  to  segment  automatically. 
They  usually  appear  as  a straight,  almost  hori/ontal,  low  cycle-frequency  line.  Ikginning  abruptly, 
they  sustain  acoustic  consistency  throughout  their  duration,  and  then  usually  terminate  sharply. 

The  voicing  regions  which  appear  prior  to  voiced  fricatives  and  stop  consonants  are  acoustical- 
ly very  similar  to  nasals,  except  that  voicing  regions  tend  to  be  lower  in  amplitude.  An  example 
follows  which  illustrates  both  a typical  /n/  and  a voicing  region  preceding  the  /g/  in  the  utterance 
"egg  nog"  (Fig.  33.HN,'’). 

The  problem  encountered  by  some  frequency-domain  analyses,  of  distinguishing  nasals  from 
liquids,  is  much  less  severe  in  the  time-domain.  The  low  cycle-frequency  components  of  liquids 
almo>^t  always  exceed  300  11/  whereas  nasals  lend  to  have  lower  cycle-frequencies  just  slightly 
below  or  above  200  Hz.  In  addition,  liquids  usually  have  higher  cycle-frequency  components  with 
amplitudes  larger  than  those  of  the  higher  cycle-frequency  components  sometimes  occurring  in 
nasals.  An  illustrative  example  demonstrates  the  dislinc  lon  between  the  nasals  /n/  and  /ni/,  and 
the  liquid  /\/  (F'ig.  34, LM,?)  in  the  utterance  "phonemic  labels”. 


Log  Cycle  L'rcquencj  (Hz) 


Chapter  II  - CIIAR  \t  I LRISIICS  OI  SraX  Il  IN  1 1 IK  I IMK-DOMAIN  Page  ()() 


Figure  33 : egg  nog  (9.HN) 

lime  (sec) 


k 


Chapter  II  - CtlARACTKRISTICS  OP  Sl’KECII  IN  THE  TIME-DOMAIN  Page  61 


Chapter  II  - f'ilARACTERlSTiCS  OP  SPEECH  IN  HIE  TIME-DOMAIN  Page  G2 


DISCUSSION 

One  of  the  most  striking  aspects  of  speech  as  rev  -aled  by  time-domain  analyses,  is  its  discrete 
nature.  This  fact  is  readily  apparent  in  the  waveform  and  LIP  plots.  Unfortunately,  this  quality  of 
speech  is  often  completely  obscured  in  spcctrographic  displays,  due  to  the  inherent  bandwidth 
limitation  which  averages  acousti-"  »ibservations  over  intervals,  generally,  of  5 m.sec,  10  msec,  or 
more.  This  creates  a visual  (and  digital)  illusion  of  smooth  gradual  transitions  from  one  acoustic 
state  to  another.  However,  precise  temporal  resolution  shows  that  with  the  exception  of  liouids, 
the  acoustic  pioperties  in  the  speech  waveform  change  abruptly  along  one  or  more  dimensions  at 
phone  boundaries.  We  also  know  that  some  pi. one  types  are  characteri/ed  by  an  explicit  temporal 
pattern  of  two  or  more  distinct  acoustic  states.  In  many  eases,  additional  redundancy  for  evidence 
of  these  boundaries  is  provided  by  the  presence  of  transition  cycles  in  and  out  of  high  cycle- 
frequency  regions,  stop  previews,  and  stop  preview  transitions.  Each  of  these  events  is  also 
accompanied  by  sharp  discontinuities  in  one  or  more  time-domain  parameters.  These  acoustic 
cverVs  clearly  designate  most  inter-phone  boundaries  as  well  as  delineating  separate  acoustic  states 
internal  to  certain  phones. 

In  the  following  table,  we  present  statistics  on  the  frequency  of  oteurrence  for  the  stop 
consonant  and  fricative  features  described  in  this  chapter.  In  addition,  we  have  found  that  we  can 
use  these  time-domain  features  in  conjunction  with  otners  which  characterize  the  other  phone 
classes,  as  the  sole  basis  for  automatic  segmentation  of  the  speech  waveform.  In  Chapter  III.  the 
performance  of  our  segmentation  program  i«  discussed  in  comparison  with  other  existing  segmen- 
tation programs,  run  on  the  sai.ie  data  set  and  generously  made  available  bv  the  speech  communi- 
ty- 


Chapter  I!  - CHARACTERISl  ICS  OF  SPEECH  IN  THE  TIME-DOMAIN  Page 


Frequency  of  Occurrence 
for 

Certain  Time-Domain  Speech  Characteristics 


On  a set  of  28  sentences  included  in  the  Lincoln  Laboratory  data  set  ( described  more  fully  in 
Ch"  r III  ),  wc  obtained  the  following  statistics  for  time-domain  feature  occurrences  in  stop 
consonants  and  fricatives: 


Percentage  Occurrence 


Stop  Consenants  (n=135!; 

Stop  Preview  Transition  15X 
Stop  Preview  26X 
Voicebar  and/or  Pause  10051 
Transition  Cycle!  s ) 6851 


(at  end  of  Release-Aspiration ) 


Fricatives  (n=ia7); 

Beginning  Transition  Cycle!  s)  6651 

End  Transition  Cycle!  s ) 605( 


Page  04 

ChapUr  III  - SEGMENTATION  OF  CONTINUOUS  SPEEC  H 
INTRODUCTION 

Speech  segmentation  usually  implies  the  division  of  the  speech  waveform  into  a series  of 
diserete  acoustic  states  which  are  directly  related  to  the  phone  string  communicated  by  that 
waveform.  This  isolation  process  has  become  important  both  for  understanding  the  basic  acoustic 
characteristics  of  individual  phones  as  well  as  comprising  an  essential  step  for  phone  .dentification 
in  most  speech  recognition  systems.  In  this  chapter  we  present,  first,  a new  segmentation  philoso- 
phy and  implementation,  and  secondly,  the  comparative  results  of  this  program  with  other 
segmentation  programs  presently  available  in  the  speech  community. 

THE  SEGMENTATION  PROGRAM 

Our  segmentation  program  is  based  solely  on  descriptions  of  the  time-domain  parameters 
which  characterize  a subset  of  the  speech  characteristics  discussed  in  Chapter  II.  Subject  to 
changes  in  these  parameters,  the  segmenter  transitions  among  eight  acoustic  states.  These  states 
are  defined  on  the  basis  of  our  understanding  of  the  temporal  sequence  of  acoustic  events 
characteristic  of  different  phone  classes.  For  example,  based  on  the  time-domain  information 
about  stop  consonants  and  fricatives  in  Chapter  II.  each  of  these  phone  classes  may  be  represented 
as  a secjuential  pattern  of  acoustic  events,  some  optional  and  some  required 

In  Fig.  .15,  the  network  describes  the  acoustic  event  pattern  for  both  voiced  and  unvoiced  stop 
consonants  The  only  absolutely  required  nodes  in  this  network  arc  the  release-aspiration  node 
and  the  preceding  voicing  and/or  silence  nodefs).  In  the  present  seg  ntation  program  implemen- 
tation. the  stop  preview  and  stop  preview  transition  nodes  arc  n 't  represented  because  we  found 
they  yielded  information  redundant  to  that  already  available  N...t,  in  Fig  36,  is  the  network 
representation  for  voiced  and  unvoiced  fricatives.  For  unvoiced  fricatives,  the  only  required  node 
is  the  frication  node  itself,  and  for  voiced  fricatives,  the  voicing  node  must  precede  this  frication 


node. 


Chapter  III  - StGMtNTATION  OK  CONTINUOUS  SPKKC II 


Page  67 


The  eight  acoustic  states  recognized  by  the  program  correspond  very  roughly  to  phone  or 
sub-phone  classes;  namely,  I)  silence  (A).  2)  unvoiced  release-aspiration  (B),  3)  unvoiced 
frication  (C),  4)  nasal  (D),  5)  transition  (E).  6)  vowel  (F),  7)  voiced  release-aspiration  (G),  and 
8)  voiced  frication  (H).  Each  of  these  states  has  already  been  described  generally  in  Chapter  II. 
Specific  quantitative  descriptions  for  each  of  these  states  as  well  as  their  transition  characteristics, 
arc  incorporated  explicitly  under  headings  of  the  same  name,  in  the  segmentation  program. 
Segment  boundaries  are  placed  when  large  changes  of  certain  kinds  arc  observed  in  the  acoustic 
parameters.  In  the  present  implementation,  wc  chiefly  use  the  parameters  of  cycle-frequency  and 
cycle  peak  amplitude  (Amax),  along  with  information  about  the  acoustic  state  duration  in  terms  of 
the  number  ^f  cycles  observed  during  a given  acoustic  state,  and  the  identity  of  the  past  two 
acoustic  states  observed.  In  order  to  test  the  acou.stic  parameters  of  any  given  cycle,  relative  to  its 
context,  the  acoustic  parameters  of  the  previous  ten  cycles  and  the  following  ten  cycles,  are 
available  for  comparison.  By  utilizing  the  information  derived  from  the  temporal  phone  patterns  in 
conjunction  with  that  available  from  cycle  context  comparisons,  we  allow  for  a great  deal  of 
individual  cycle  variability  without  triggering  excessive  numbers  of  extra  or  false  segment  markers. 
In  addition,  however,  even  very  short  duration  low  amplitude  events  such  as  fricative  and  stop 
consonant  transition  cycles  are  readily  recognized  because  they  conform  to  the  known  temporal 
phone  patterns  and  bear  known  parameter  relationships  to  their  context.  Redundancy  of  this  sort 
is  essential  for  segmentation  decision  reliability. 

The  program  itself  is  quite  simply  organized  Each  cycle  of  the  waveform  is  evaluated  in  turn, 
to  determine  if  it  belongs  to  the  present  acoustic  state,  or  if  it  marks  the  beginning  of  a new 
acoustic  state.  A certain  set  of  state-dependent  tests  is  applied  to  the  parameters  of  each  cycle.  In 
Part  I of  the  program,  only  those  tests  associated  with  the  pre.sent  acoustic  state  are  applied  to  the 
parameters  of  a new  cycle  (each  utterance  is  abitrarily  initialized  to  the  silence  state  A). 

In  Fig.  37,  all  the  acoustic  state  tests  are  fully  described.  Here  CF,,  A,,  and  D,,  respectively, 
designate,  for  the  present  cycle  (present  time  = t),  the  parameter  values  of  cycle-frequency,  cycle 
peak  amplitude,  and  the  present  acoustic  state  duration,  up  to  but  not  including  the  present  cycle. 


ChapJcr  III  - SEGMENTATION  OF  CONTINUOUS  SKEEt  If 


Vagc  OB 


For  example,  if  the  present  acoustic  state  is  the  silence  state  A,  then  only  two  tests  are  applied 

to  a given  cycle.  These  tests  are  applied  in  the  order  indicated.  The  first  says  that  if  Ih,  cycle- 
frequency  of  this  cycle  is  greater  than  120  H/,,  and  either  the  present  cycle  or  the  next  cycle  has  a 
cycle-frequency  greater  than  600  Hz.  then  transition  to  state  B.  The  second  test  says  that  if  the 
cycle-frequency  values  of  the  present  cycle  and  the  next  cycle  both  exceed  200  Hz,  then  transition 
to  state  E.  If  the  conditions  of  any  of  these  tests  are  met,  then  a segment  marker  is  inserted,  a new 
acoustic  state  is  recognized,  and  the  program  then  commences  to  evaluate  the  next  cycle  in  the 
same  fashion.  Whenever  none  of  the  te.sts  is  met  for  transitioning  out  the  present  acoustic  state, 
and  the  present  acoustic  state  has  already  been  observed  for  two  or  more  cycles,  then  three  more 
tests  arc  applied  to  this  cycle.  These  three  tests  comprise  all  of  Part  II  of  the  program.  These  three 
tests  are  described  in  Fig.  38. 

• In  some  cases,  these  preceding  tests  invoke  the  additional  tests  TT,  TP,  TN,  and  TS.  The  first 
of  these,  TT,  tests  to  see  if  a return  should  be  made  to  an  acoustic  .state  previously  encountered. 
Here  "oldstate"  refers  to  the  acoustic  state  immediately  preceding  the  present  one,  and 
"oldoldstale"  refers  to  the  state  just  prior  to  the  oldstate.  The  other  tests,  TP,  TN,  and  TS,  are 
general  tests  for  signal  periodicity,  nasality,  and  frication,  respectively.  All  of  these  tests  are  also 
described  in  Fig.  38.  If  all  the  necessary  tests  in  Part  I and  Part  II  are  applied,  and  none  are  met. 
this  newly  examined  cycle  is  assumed  to  belong  to  the  present  acoustic  state,  and  testing  then 


commences  for  the  next  cycle. 


Chapter  III  - SEGMENTATION  OF  CONTINUOUS  SPEECH 


Page  69 


STATES 

A 

F silence) 


B 

(runvoiced  release-aspiration) 


('Unvoiced  frication) 


D 

(rnasal) 


PARTI 


TESTS 


«)  'f  (CF,>  I20)rv(max(CF,.CF,^,)>  600^8 
2)  if  n)in(CF,.CF,^,)>200->E 


if  CF,<I20--A 

2)  if  (A,;^  I0(V/^TP)-»G 

3)  if  (CF,  < 200)-^'E 

4)  if  max(CF,.CF,.,X600->E 


1)  if  CF,  <I20-*A 

2)  if  (A,>I00)a(TP)  *11 


1)  if  CF,<  100-*  A 

2)  if  (CF,>650)  A (CF,^,>  400)  A ((CF, ^2  >400)  V(CF,^j>  400))-. B 

3)  if  (CF,>350)  A (CF,^,>350)A  (A,^j>350)/  (A,^j>350)  p 


Figure  37 


I 


( liMjiltr  III  - SHGMHN  I A I ION  OI  t ON  I INUOUS  SI’KIXII 


/t) 


S I A I LS 


iiisrs 


1)  if  (Cr,.,>350)A(D,>  D-^F 
E 

transition  region) 


F 

( vowel)  (CL,>  1()()())A  (2ormoreof(CF,^,.CF,^,.CI  ,,,.C  |-,,j)  > lOOO  ' IT 


1)  if  (a-,<2()())'  (Cr,  >120)  ‘K 

G 

2)  if  (CF,<  I20;-*A 

( voieed  release-aspiration) 

if  (A,  <I0())/  ( IP)a  (niax(A,.|.A,  J<  100)  -B 


II 

(•.voieed  frieation) 


f (CT,<200)a  (CL.  > 120)  ' !■: 
'f  CL,  <120  A 


'•  < lOO)'!  ( -TP)  A A . . , 

ui).ix(A,.,.A,  ,)  < looj 

•f)  .1  ''•'^'x(CL.XT,.,)<  !()()()).  ,(CL-,^  + (CL,,,)<.l()0(),  . 


Ligiire  37  (Com.) 


Chapter  III  - SEGMENTATION  OF  CONTINUOUS  SPEECH 


Page  7 1 


PART  11 


For  all  states  with  duration  > I. 


1 )  if  (’-A)  A (-D)  A (CF,  < 120).  A 


2)  if  ('  D)a  (CF,  <350)A  (CF,  > IOO)A(max(A,.A,^,)>  120)-yi-[^ 


3)  if  (■  A)/s  (-B)a  (-C)A  (.  G)/\  (-H)A(CF,>  1000) 

TESTS 

TT 

(test  oldstate) 

1)  If  (oldstate=E)  A (oldoldstate=B  C G H) ->oldoldstate 

2)  C 


TP 

(periodicity  eheck) 


I nis  test  IS  true  for  regions  of  amplitude  and  eyele-frequeney  periodicity.  ^ 


TN 

(nasality  cheek) 

If  2 or  more  of  the  next  4 eycles  have  cyele-frequency  < 400-  ^ D 


TS 

(frication  eheck) 

Over  the  following  9 cycles,  this  test  cheeks  to  see  there  arc  no  low  frequency  cvcics  < ^ tnnm  i 


Figure  38 


Chapter  III  - SEGMENTATION  Ol- CONTllWOTJS^EEOT 


Kage  u 


TESTS 


1)  if  F + E(dur.  < 10  msec)  'F 

2)  if  H(dur.  > 20  msec)  Cfdur.  < 20  msec)  + Hfdur.  > 20  msec)  - H 

3)  if  Cfdur.  > 20  msec)  + Hfdur.  <20  msec)  + Cfdur.  >20  msec)  'C 

4)  if  Cfdur.  > 20  msec)  Bfdur  < 20  msec.)  + Cfdur.  > 20  msec)  G 

5)  if  Bfdur.  > 20  msec)  + Cfdur.  < 20  msec)  + Bfdur.  > 20  msec) ->  B 


Figure  39 


Ch  .Her  III  - SI'.GMKN  l ATION  <)l'  CONTINUOUS  SIM.IIC  II 


I’age  73 


At  the  end  of  an  utterance,  the  output  of  this  program  consists  of  a series  of  ti  nes  at  which 
segment  markers  were  inserted  with  their  respective  acoustic  state  designations.  This  segmentation 
output  is  then  run  through  an  editing  pr  )gram  which  concatenates  certain  kinds  of  segments,  and 
deletes  very  brief  segments  of  noise-iike  regions. 

This  editing  program,  described  in  Fig,  39.  only  alters  certain  boundaries  of  acoustic  segments 
which  are  less  than  20  msec  in  duration  If  the  acoustic  state  F (of  any  duration)  is  followed  by 
state  E,  where  E has  a duration  of  less  than  10  msec,  than  the  segment  marker  between  states  F 
and  E is  removed,  and  the  combined  region  is  classified  as  state  F.  if  an  unvoiced  frication 
segment  of  less  than  20  msec  duration,  is  both  preceded  and  followed  by  voiced  frication  regions 
of  duration  equal  to  or  greater  than  20  msec,  then  the  unvoiced  frication  segment  boundaries  are 
removed,  and  the  previously  defined  sequence  of  the  three  segments  (voiced,  unvoiced,  and  voiced 
frication)  is  classified  as  a single  voiced  frication  segment.  Such  merging  also  occurs  if  a short 
voiced  frication  segment  is  surrounded  by  longer  unvoiced  frication  segments,  which  then  prevail. 
In  thg  same  way,  merging  is  performed  for  successive  voiccd-unvoiccd  release-aspiration  segments 
of  stop  consonants. 


Chapter  III  - SL(;MEN lATION  Ol’  CONTINUOUS  SIM.KCII 


Page  74 


COMPARATIVE  SEGMENTATION  RESLLTS 

In  July,  1973,  a Segnicnlation  and  Labeling  Workshop  was  held  at  the  Computer  Science 
Deparlment  of  Carnegie-Mellon  University,  Pittsburgh,  Pa.  In  preparation  for  this  workshop, 
certain  of  the  speech  groups  contracted  by  ARPA  (Advanced  Research  Projects  Agency)  submit- 
ted a set  of  continuous  speech  utterances  typical  of  the  input  to  their  speech  understanding  systems 
under  development.  At  Lincoln  Laboratory,  a subset  o.  esc,  31  utterances  in  all,  were  chosen, 
prepared,  and  digitized  at  10  kHz  and  20  kHz.  Analog  and  digital  tapes  of  these  were  then  made 
available  to  all  groups  in  the  speech  community  at  large,  who  wished  to  submit  their  results  for 
segmentation  and/or  labeling  procedures  on  this  set  of  data.  At  the  workshop,  these  results  were 
compared  and  discussed. 

This  section  describes  the  results  of  using  time-domain  analyses  for  automatic  segmentation  of 
continuous  speech.  As  previous!)  described,  our  program  looks  for  specific  temporal  event 
patterns  and  discontinuities  in  various  time-domain  parameters,  as  indicators  of  phonetic  bounda- 
ries. The  output  of  this  program  has  been  compared  with  the  output  of  four  other  automatic 
segmentation  programs  presc.'tted  at  this  speech  worksnop  (or  the  improved  results  subsequently 
submitted),  on  a set  of  five  uUcra.iees,  one  utterance  for  each  of  five  speakers  (4malc,  I female). 

These  utterances  follow; 

1 ) Do  any  samples  contain  tridymite? 

2)  Count  where  type  equals  linear  equations  and  runtime  less  than  five  six. 

3)  I want  to  do  phonemic  labcliiig  on  sentence  six. 

4)  Display  the  phonemic  labels  above  the  spectrogram. 

5)  Do  you  have  any  rectangular  cylinders  left? 

In  Appendix  A,  there  arc  two  different  visual  displays  for  each  of  thc.se  utterances.  These  are  I ) 
the  LIP  plots,  and  2)  the  spectrograms  (prepared  for  the  workshop). 

This  comparison  test  is  a particularly  strenuous  test  of  robustness  because  each  of  the  five 
programs  segment,  without  any  speaker  training,  utterances  spoken  by  five  different  speakers 
recorded  under  different  environmental  conditions.  Since  the  .segmentation  programs  of  all  the 
groups  represented  arc  still  being  actively  dcvelopi  d (including  this  author’s  program),  none  of  the 


Chapicr  III  - SKGMIt.WTAI  ION  OF  CON  I INI  OUS  Sl'KKC  ll 


Page  75 


segmentation  remits  presented  here  should  be.  m uiy  sense,  construed  as  optinii/ed  or  finalized. 
Despite  the  preliminary  nature  of  all  of  these  results,  we  feel  that  since  each  of  these  programs  has 
been  run  on  the  same  data  set.  a comparison  of  these  is  the  best  and  most  appropriate  test 
presently  available.  In  the  discussion  which  follows,  our  program  results  are  designated  by  "TDS" 
(time-domaiti  segmentation)  The  segmentation  results  of  the  other  groups  are  designated  as  '13  '. 
’ L'  , U",  ^id  "E",  respectively. 

As  the  fir.t  step,  we  carefclly  hand-segmented  each  utterance  ■'!h  the  assistance  of  wave- 
forms, spectrograms.  LIP  plots,  and  records  of  cycle-by-eycle  time-doinam  parameters.  With  few 
exceptions,  our  hand  segmentation  agreed  closely  with  others  prepared  for  the  workshop.  In  this 
hana  sCgntentation,  all  boundaries  between  acoustically  distinct  segments  were  marked  Then  the 
output  of  each  automatic-segmentation  was  compared  with  this  hand  segmentation.  Certain 
conventions  for  doing  this  were  established 

.Primary  hound  es  were  designated  as  those  comprisine  the  minimal  set  of  boundaries 
considered  essential  for  a basic  segmentatiem  of  the  acoustic  waveform  ! .isi  it  was  assumed  that 
one  primary  boundary  exist-,  be'  een  i very  tw,.  successive  phones  In  those  instances  where  sharp 
acoustic  discontinuities  Jo  no  accc'  between  two  phones  (as  frequcitly  occurs  ai  the  beginnings 
and  ends  of  liquids),  a best  ap[iroxin;ation  of  such  a boundary  w.is  ..idc.  For  stop  consonants, 
two  primary  boundaries  wci  • ronsidered  necessary,  one  designating  the  pause  region  and  another 
at  the  onset  of  the  release-aspiration  region.  A primary  boundary  was  aKu  required  to  designate 
the  onset  of  voicing  if  i;  lasted  for  more  than  20  msec  prior  to  ihe  high  eyelc-frcquency  region  of  a 
fricative  or  stop  cc  isonant  Secondary  boundaries  mark  regions  which  are  acoustically  distinct  but 
are  not  necessarily  related  phonetically  to  the  speeen  stream.  Transitional  segments  between  two 
successive  phones  are  marked  by  second. iry  bound.irics.  In  iddition.  where  there  is  a long 
transition  in  or  out  of  a phone  such  that  the  acoustic  charactcrisia s of  this  transition  differ 
substantially  from  the  acoustic  cttaracteristics  prototypic  of  that  phone,  a secondary  bourvda.'v 
demarcates  this  region 

The  aim  in  comparing  each  automatic  segmentation  to  the  nano  segmentation  is  to  evaluate 
how'  many  primary  hound.iries  wcic  foimd  and  by  how  much  time  they  were  displaced  from  the 


Chapter  ill  - SECMKNTATION  OF  CONTINUOUS  SPEECH 


Page  76 


hand  segmented  boundaries.  Where  a transitional  segment  occurs  between  two  phones  in  the  hand 
segmenf'.tion,  the  primary  boundary  between  those  phones  is  considered  to  be  that  boundary,  as 
provided  by  a given  automatic  segmentation  procedure,  which  is  closest  to  one  of  the  two 
boundaries  provided  by  the  hand  segmentation.  An  e.xample  is  shown  in  the  following  excerpt  of  a 
single  phone,  /t/,  in  a multipie  segmentation  comparison  plot  from  the  first  utterance  "Do  any 
samples  contain  triuymite?  In  the  multiple  segmentation  comparison  plot  which  follows  in  Fig. 
40,  the  segmentations  produced  by  the  different  programs  (IDS,  B,  C,  D.  and  E),  appear  on 
successive  lines  under  the  waveform  they  segment.  Up-arro  v«  .moic  segmentation  markers.  In 
the  hand  segmentation  shown  above  the  waveform  of  this  /t/  phone,  a transition  segment  is 
mat  .ed  at  the  end  of  the  release-aspiration.  This  transition  ;ement  .s  marked  with  a double- 
headed arrow)  In  this  example,  we  denote  the  beginning  marker  of  this  transition  segment  with 
"I",  and  the  end  marker  as  "11"  The  multiple  segmentation  comparison  plot  shows  that  TDS 
places  a marker  closest  to  II.  Further,  v e see  that  B has  a marker  closest  to  II.  C'  to  I,  D to  I.  and  E 
;o  n 


CONTAIN  * * 

T - T I T « T 


Fig,'jre  40:  transition  segment  follow'ing  /t/  in  "contain"  (o,BB) 


Clwpicr  III  - SKGMENTATION  OI*  C ON  I INUOUS  SPEECH 


Page  77 


For  each  scgmenta'.ion  program  then,  the  deviation  in  time  difference  is  computed  from  the  marker 
it  provides,  to  the  closest  marker  provided  by  the  hand  segmentation,  as  shown  above. 

Similarly,  if  a transition  segment  is  marked  in  an  automatic  segmentation,  whichever  of  the 
two  markers,  is  the  clo.ser  to  the  hand  segmentation  marker,  that  marker  is  designated  as  the 
primary  bcundaty,  and  the  other  as  a secondary  boundary.  In  computing  the  absolute  deviation  of 
automatic  segmentation  boundaries,  only  the  primary  boundaries  are  considered.  When  an 
automatic  segmentation  provides  for  more  than  one  primary  boundary  where  only  one  boundary 
exists  in  the  hand  segmentation,  the  unmatched  marker(s)  arc  considered  to  be  "extra".  Auto- 
matic segmentations  which  insert  no  boundaries  within  35  msec  of  primary  boundaries  in  the  hand 
segmentation,  are  considered  to  have  missing  boundaries.  However,  if  an  unmatched  boundary  is 
inserted  more  than  35  msec  from  a missing  boundary,  it  is  nc*  counted  as  being  an  extra  boundary. 

We  have  another  example  (Fig.  41)  fiom  the  first  utterance  "Do  any  samples  contain 
tridyniite?".  We  compare  each  of  the  automatic  segmentations  with  the  hand  segmentation,  for 
the  first  three  phones  of  the  word  "iridymite".  Above  each  of  the  automatic  segme.itation 
markers  in  the  comparison  plot,  we  have  labeled  the  marker  as  "p"  for , rimary.  "s"  for  secondary, 
or  "e"  for  extra.  Areas  where  a primary  boundary  is  missing,  are  labeled  with  an  "m".  Note  that 
TDS  and  E designate  transitional  segrrents  in  different  ways;  TDS  uses  a double-headed  arrow 
segment  label  whereas  E uses  left-barbed  and  right-barbed  half  arrows  foi  marking  the  trarisition 


segment  boundaries. 


i^napicr  ill  - 


l>;it:c  7X 


Figure  41:  comparison  of  some  segments  ir  "tridymite  (0,BB) 


An  overall  tally  was  computed  for  each  automatic  segmentation  program,  for  each  utterance, 
by  separately  totalling  the  number  of  primary  boundaries,  secondary  boundaries,  extra  boundaries, 
and  missing  boundaries  The  primary  boundaries  found  by  each  program  were  further  analyzed 
within  each  utterance.  These  automatically  derived  primary  boundaries,  each  characterized  by  its 
absolute  deviation  in  time  from  the  hand-segmented  boundaries,  were  ..eparated  according  to 
general  phone  classes.  For  example,  the  primary  boundaries  which  designated  the  start  of  vowels 
were  grouped  together,  as  were  those  for  pauses,  stop  consonants,  liquids,  fricatives,  nasals,  ard 
voicing  regions  whi^h  preceo.  .he  voiced  fricatives  and  stop  consonants.  Then  for  each  class,  of 
phones,  an  average  was  computed  of  the  absolute  deviations  in  time  f.om  the  hand-segmentation 


pi 


boundaries. 


viiapiLT  III  - I A I lurs  ui‘  CUi\  I irs'LlUU-S  Sl’KIX'II 


Payc  79 


A summary  table  (Fig.  42)  combines  the  rc<>ulls  from  all  the  utterances  combined,  for  each 
segmentation  program  individually.  The  total  number  of  phones  found  in  each  class,  is  designated 
in  parentheses  following  the  absolute  time  deviations  (expressed  in  milliseconds)  of  the  primary 
boundaries  in  each  class.  In  addition,  we  also  have  computed  the  accuracy  for  each  segmentation 
program  in  finding  the  end  primary  boundaries  of  fricatives  and  stop  consonants.  In  Appendix  B, 


we  present  separate  analyses  fc.'  each  of  the  five  utterances. 


ABSOLUTE 

DEVIATION 

(MSEC)  OF 

PRII-IARY 

BOUNDARIES 

BY  PHONE 

CLASS 

PROGRAJ-tS 

PAUSES 

STOPS 

VOWELS 

LIQUIDS 

FRICATIVES 

NASALS 

VOICING 

TDS 

2.3(24) 

1.3(35) 

6.0(49) 

7.0( 18 ) 

4.7( 23  ) 

1.6(25) 

2.0( 22  ) 

B 

11 .6( 13 ) 

8.6(  26) 

1 1 .4( 55  ) 

7.  1(21  ) 

10. 8(  21  ) 

11.7(22) 

1 1 .2(  18  ) 

C 

8.3(9) 

9.1(13) 

1 1 .7( 43  ) 

8. 3(8  ) 

13.5(  21  ) 

8.2(  18  ) 

6.0(  14  ) 

D* 

7.0(  14  ) 

4.5( 20) 

9. 1(40) 

10. 2( 14  ) 

11. 7(  13) 

7.8( 17) 

10. 0( 12) 

E 

10. 7(  18) 

11.0(20) 

12.7(49) 

12. 5( 20 ) 

14. 2(  19) 

1 1 .6(  17  ) 

5.6(  17  ) 

PRIMARY 

END  BOUNDARIES 

SECONDARY  BOUNDARIES 

PROGRAMS 

STOPS 

FRICATIVES 

» FOUND 

TDS 

5.1(24  ) 

5.0( 22  ) 

46 

B 

9.0( 25  ) 

1 1 .6( 22  ) 

9 

C 

8.  1(  19) 

12.9(  16  ) 

4 

D 

8. 1( 18  ) 

9.6(  14  ) 

6 

E 

11 .8( 24  ) 

1*4.  K 23  ) 

9 

PROGRAMS 

PRIMARY  1 

BOUNDARIES  FOUND 

AVERAGE 

ABSOLUTE  DEVIATION  (msec) 

M found 

percentage  (total 

= 216) 

TDS 

194 

91* 

3.5 

B 

176 

81* 

10.6 

C 

126 

58* 

10. 1 

D* 

130 

86* 

8.8 

E 

160 

74* 

11.5 

MISSED 

PROGRAMS 

EXTRA 

BOUNDARIES  FOUND 

PRIMARY  BOUNDARIES 

MISSEo  PLUS  EXTRAS 

# found  percentage  (total=216)  missed  percentage 

K total 

percentage 

TDS 

38 

. . 18* 

20 

9* 

58 

27* 

B 

43 

20* 

40 

19* 

83 

38*  , 

C 

9 

4* 

90 

42* 

99 

46* 

D 

51 

34  5 

22 

14* 

73 

48* 

E 

32 

15* 

56 

26* 

88 

41* 

PROGHAiMS 

PERCENTAGE  STOPS  FOUND 
( total  = 36  ) 

PERCENTAGE  FRICATIVES  EOUND 
( total =24  ) 

TDS 

97* 

96* 

B 

72* 

88* 

C 

39* 

88* 

D 

83* 

81* 

E 

615 

79* 

Chapicr  III  - SI  CMKN  I A I ION  OF  CON  I INUOLS  SI'I  KC  II 


I'at’c  HO 


There  are  several  imporlaiii  as|  eets  to  eonsider  while  comparinj;  the  results  of  these  five 
different  automatic  segmentation  programs  Segmentation  programs  are  designed  to  produce 
segment  markers  when  they  detect  changes  from  one  acoustic  state  to  another.  The  amount  of 
acoustic  change  necessary  for  triggering  detection  of  a new  acoustic  state  is  generally  determined 
by  one  or  more  thresholds. 

Experience  has  showti  that  with  many  segmentation  programs,  systematically  varying  the 
threshold  valuc(s)  causes  results  which  range  all  the  way  from  leaving  out  very  few  real  boundaries 
but  inserting  many  extra  boundaries,  to  missing  many  of  the  real  boundaries  but  adding  very  few 
extra  ones.  Depending  on  how  a given  program  is  to  be  used,  these  thresholds  must  be  set  to 
minimii'e  segmentation  errors  of  certain  kinds.  Eor  example,  a segmentation  program  might  be 
designed  chiefly  to  locate  certain  easily  recogni/able  acoustic  states  ("islands  of  reliability”).  In 
this  case,  very  conservative  thresholds  might  be  used  to  obtain  a segmentation  which  accurately 
locates  most  of  the  best  defined  boundaries,  misses  many  of  the  less  distinct  boundaries,  and 
inserts  very  few-  if  any  extra  bro.indarics  However,  another  segmentation  program  which  is 
designed  to  locate  as  many  real  acoustic  boundaries  as  possible,  would  probably  set  lower 
thresholds  f-  ^ locating  new  acoustic  states  Consequently  this  might  produce  results  where  most  of 
the  true  acoustic  boundaries  are  detected  (though  less  ilistinct  botiiularics  might  not  be  so 
accurately  determined  in  time  as  sharper  boundaries)  and  tlierefore  very  few  boundaries  are 
missing,  but  a great  many  extra  boundaries  ni'uht  also  be  inserteil.  In  short,  there  exists  a trade-off 
between  the  number  of  missing  boundaries  and  the  number  of  extra  boundaries. 

Previously,  we  have  described  the  kinds  of  sharp  discontinuities  which  exist  at  most  phone 
boundaries,  and  certain  temporal  patterns  of  acoustic  changes  characteristic  of  fricatives  and  stop 
consonants  Our  segmentation  program  incorporates  a kirge  subset  of  thes--  features.  In  particular, 
we  aimed  for  detecting  most  of  the  prmi.iiv  boundaries  and  locating  them  as  accurately  as  possible, 
especially  the  stop  consonants  for  which  accurate  location  is  most  crucial. 

As  indicated  in  the  preceding  tab!  the  IDS  program  detected  more  boundaries  (91  "n)  than 
the  other  programs,  and  located  these  2f'(l"i«  - 32,‘'"si  more  accurately  when  the  absolute  devia 
tions  for  all  the  phones  were  averajied  together  Only  C had  fev  r extra  boundaries  (4"<i)  than 


Chapter  III  - SEGMENTATION  OF  CONTINUOUS  SPEECH 


Page  8 1 


TDS  ( 18‘/o),  but  on  the  other  hand,  it  also  missed  42%  of  the  primary  boundaries  ?s  compared  to 
9%  for  TDS.  This  is  an  example  of  the  missing-extra  boundary  trade-off.  However,  TDS  was  also 
able  to  find  the  greatest  number  of  fricatives,  stop  consonants,  and  secondary  boundaries,  as  well 
as  pauses,  nasals,  and  voicing  regions. 

Detecting  the  release-aspiration  segments  of  stop  consonants  is  very  important  because  they 
last  for  such  a short  time.  The  average  duration  for  the  release-aspiration  region  of  stop  conso- 
nants in  all  five  utterances  was  25. 1 msec  with  several  shorter  than  5 msec.  Fricatives,  by  contrast, 
averaged  79.9  msec  in  length.  TDS  typically  located  the  start  of  release-aspiration  with  an  error  of 
1.3  msec.  As  compared  with  the  other  segmentation  programs,  the  TDS  boundaries  were  about 
350‘?o  to  850%  more  accurate.  Although  the  absolute  magnitude  of  their  errors  was  small,  from 
4.5  msec  to  1 1.0  msec,  these  errors  were  large  relative  to  the  release-aspiration  durations.  After 
these  errors  were  further  convolved  with  several  milliseconds  more  of  error  in  determining  the  stup 
consonant  end  boundaries,  it  is  no  surprise  that  many  slop  consonants  were  missed  altogether  by 
the  other  programs.  For  the  other  segmentation  programs,  the  sum  of  beginning  boundary  and  end 
boundary  errors  typically  averaged  2/3  of  the  release-aspiration  duration.  This  means  that  if  these 
segments  are  used  either  for  providing  acoustic  templates  or  training  fo*-  acoustic  recognition 
programs,  or  are  used  fo'  comparison  against  other  templates,  good  recognition  results  are 
doubtful.  As  for  other  classes  of  phones,  the  average  error  of  TDS  in  locating  fricatives  was  4.7 
msec,  for  pauses  2.3  msec,  for  nasals  1.6  msec,  and  for  voicing  regions  2.0  msec.  Although  nasals 
and  voicing  regions  are  usually  acoustic  steady  stales,  they  begin  abruptly,  are  well  charac^eri^cd 
throughout,  and  end  abruptly.  As  can  be  seen  from  Fig.  42,  although  programs  B and  F,  found 
several  more  liquids  and/or  vowels  than  TDS,  the  TDS  program,  at  its  worst  overall,  was  as 
accurate  or  more  accurate  than  all  of  the  other  programs  in  locating  liquid  and  vowel  primary 
boundaries. 

Another  issue  which  arises  is  the  robustness  or  consistency  of  segmentation  procedures  across 
different  speakers  ?nd  recording  conditions.  For  purposes  of  comparison,  tables  of  absolute 
deviations  by  phone  classes  arc  presented  for  each  utterance,  in  Appendi.j  B,  as  previously 
mentioned,  consistent.  We  observe,  for  example,  that  for  utter, '•.nee  U3,  C accumulated  the  lowest 
total  of  missing  plus  extra  boundaries,  only  6 in  all,  whereas  in  utterance  #4,  they  accumulated  the 


t iKipicr  III  - SI  (,MI  Nl  M ION  Ol- CONI  IM'ODSSIM  I ( II 


I'ajiL-  82 


highest  total  of  missing  plus  extra  boundaries.  14  in  all  missing.  3 extra!.  It  should  be  noted 
though,  that  utterance  W4  was  the  only  utterance  spoken  by  a female  (I.M).  Since  male  voices 
have  been  used  almost  exelusively  for  acoustic  speech  research,  the  male  speech  < haracleristics 
incorporated  int('  some  segmentation  labeling  programs  may  preclude  good  performance  on 
female  voices.  If  the  fourth  utterance  is  not  included  for  f.  their  score  at  finding  primary 
boundaries  for  all  the  utterances  combined  increases  from  38",.  to  fif>  ”o.  and  for  detecting  stop 
consonants,  it  increases  from  3y"i.  to  4I"<>.  Segmentation  results  for  I)  were  not  available  either 
for  the  last  half  of  the  third  utterance  t>r  the  entire  fourth  utterance.  Therelore  their  percentage 
scores  are  computed  proportionately  to  the  material  for  which  their  segmentation  results  were 
available.  Hand  segmentation  sl.ows  the  material  which  13  completely  segmented,  to  contain  a 
total  of  152  primary  boundaries,  including  16  fricatives  and  24  stop  consonants 


Chapler  IV  - ALLOPHONKS  OF  FRICA  FIVES  AND  STOP  CONSONANTS 

THE  PROBLEM 

Our  study  of  allophoncs  is  prompted  quite  simply  by  the  fact  that  the  acoustic  manifestation 
of  any  phone  is.  to  varying  degrees,  a function  of  the  context  or  acoustic  environment  in  which  it 
occurs.  As  a consequence,  we  must  understand  these  coarticulation  phenomena  in  order  to 
di.scriminate  well  between  phones  of  the  same  class,  among  the  stop  eonsonants,  for  example.  We 
use  the  term  "allophonc"  to  designate  a phone  embedded  in  a certain  kind  of  environment,  with 
respect  to  post-  and/or  pre-context.  For  example,  the  allophones  of  /k/  include  those  which  arc 
.nasalized,  retroflexed,  rounded,  and  so  forth.  In  general,  it  has  been  recognized  that  the  context 
following  a given  phone,  exerts  a greater  influence  on  that  phone's  acoustic  manifestation  than 
does  its  preceding  context. 

An  extreme  example  of  contrast  in  the  acoustic  characteristics  of  a phone,  due  to  context 
differences,  is  demonstrated  by  a comparison  of  Figs.  43  and  44,  Here  the  two  allophoncs  to  be 
compared  are  the  unrounded  /k/  and  the  rounded  /k/,  in  the  words  "king"  and  "queen", 
respectively.  The  utterance  of  Fig  43  is  "Pawn  to  king  four",  and  the  utterance  in  Fig.  44  is 
"Pawn  to  queen  four".  In  these  figures,  it  is  apparent  that  the  average  cycle-frequency  of  the 
release-aspiration  of  the  /k/  in  "queen"  is  substantially  lower  than  that  of  the  /k/  in  "king".  This 
is  due  to  the  rounding  of  the  lips  which  effectively  lengthen  the  vocal  tract,  thereby  lowering  the 
frequencies  emitted.  In  addition,  we  know  that  for  both  /k/  and  /g/.  in  general,  the  precise  place 
of  articulation  varies  somewhat  as  a consequence  of  coarticulatory  factors.  Although  it  con.es  as 
no  surprise  that  different  allophoncs  of  the  same  phone  have  different  acoustic  characteristics,  it 
does  create  certain  problems.  It  means  that  the  ability  to  recognize  a given  phone  in  one  context  is 
not  generalizablc  to  recognition  of  that  phone  in  other  contexts.  In  perception  experiments, 
Schatz  |S2]  has  demonstraied  that  context  is  vital  for  proper  identification  of  voiceless  stop 
consonants.  Given  almost  any  parameterization,  a great  deal  of  overlap  between  similar  phones  is 
unavoidable.  The  intra-plione  differences,  (that  is,  the  diJerenccs  between  allophones  of  a given 
phone)  arc  neatly  as  great  as  the  inter-phone  differences.  Therefore  studies  are  required  on  the 
acoustic  variatiot.s  of  allophones.  In  the  past,  this  work  has  been  mostly  concerned  with 
frequency-domaiti  examination  of  vowel  allophoncs  [S4].  And  ntuch  of  this  has  been  centered  on 
analyses  of  vowel  formant  trajectories  (S5|. 


1 


Chaplcr  IV  - ALLOI'IIONF.S  OF  FRK  A I IVES  ANI)  SIOI*  CONSONANTS 


AN  ALLOPHONE  EXPERIMENT 

Since  lime-domain  leehniques  are  particularly  suitable  for  the  acoustic  eharaclerizalion  of 
fricatives  and  slop  consonants,  we  have  chosen  to  study  the  allophones  of  these.  Shoup  has 
generously  made  available  to  us.  audio  tape  recordings  of  3 speakers,  all  linguists,  (2  male,  1 
female),  each  reading  22.S  ulteraiiees  encompassing  both  the  common  and  rare  allophones  of  the 
fricatives  and  slop  consonants  in  general  Lnglish.  Details  on  the  specific  allophone  designations, 
including  phonetic  transcriptions  for  all  of  these,  have  previously  been  ( iblished  by  Shoup  (S3). 
Each  of  these  utterances  is  in  citation  form;  e.g.  "no  foe",  "lube  moves",  with  the  allophone  of 
interest  embedded  in  a suitable  context.  1 he  allophones  enumerated  lor  the  fricatives  include 
nasalization,  retroflexion,  dentali/ation.  and  rounding,  and  for  the  stop  consonants  include 
aspiration,  nasalization,  retroflexion,  palatalization,  dentali/ation,  and  rounding,  as  well  as  the 
minimally  released  form.  Eor  both  phone  classes,  the  unmodified  allophone  is  also  included;  this 
generally  occurs  in  a neutral  context  of  high  front  vowels.  Not  all  allophones  are  possible  for  all 
phones.  For  the  stop  consonants,  dentali/ation  is  restricted  to  /t/  and  /d/'.  while  palatalization  is 
restricted  to  /g/  and  ,\f.  All  other  allophone  forms  mentioned  above  are  possible  for  all  the 
slops. 

We  digitized  the  data,  as  previously  described,  and  to  use  cyele-by  cycle  parameters  in  order 
to  ascertain  precisely  the  beginning  and  end  points  of  each  allophone  as  well  as  the  boundaries  of 
all  acoustically  distinct  subphonelic  segments,  such  as  transition  cycles.  Exact  segmentation  is 
crucial  for  computing  the  best  possible  acoustic  characteristics  of  the  segments  themselves.  This 
work  required  the  parameters  o]  each  eycle  of  the  waveform  to  be  examined,  individually,  by 
hand.  We  performed  this  segmentation  chiefly  on  the  basis  i)f  the  cycle-by-cvele  lime-domain 
parameters,  along  with  the  aid  of  UP  plots,  waveforms,  and  listening  to  the  audio  tapes.  i times 
of  the  individual  acoustic  segment  bound;  ries  within  each  allophone  were  noted  on  the  initial 
up-crossing  of  the  first  cycle  of  each  acoustic  segment.  For  fricaiive.s,  these  acoustic  segments 
included  voicing  regions,  unvoiced  fricaiion,  voiced  fricalion,  and  transition  regions  (both 
individual  iransilio i cycles  as  well  as  more  gradual  extended  transitions)  in  and  out  of  fricalion 
regions.  For  slop  consonants,  these  included  stop  preview  transitions,  slop  previews,  voicing 
regions,  pause  cycles,  release-aspiration,  and  transitions  (both  individual  transition  cycles  as  well 


Chapter  IV  - AIXOPIIONES  OK  FRICAl  IVES  AND  SlOP  CONSONANTS 


3a  longer  transitions)  at  the  conc'usion  of  relrasc-iispiration.  As  previously  deseribed  in  Chapter 
il,  these  acoustic  segments  may  be  ascertained  from  abrupt  changes  of  certain  kinds  occurring  in 
one  or  more  of  the  cycle  parameters.  The  cycle  parameters  we  examined  for  this  detailed  hand 
segmentation,  consisted  of  cycle-frequency,  cycle-amplitude,  and  microstructurc.  After  this  work 
was  completed,  all  the  aeoustic  .segment  boundaries  had  to  be  entered  into  files.  These  files  were 
then  used  to  direct  a battery  of  statistical  tests  to  be  performed  on  each  of  the  acoustic  segments 
specified. 

The  results  could  be  used  to  compare  the  acoustic  characteristics  of  like  segments  for  all  the 
allophoncs.  And  such  quantitative  results  may  then  address  a number  of  issues,  such  as: 

1)  What  acoustic  differences  exist  generally  between  /p/,  ,'t/.  and  /k/,  or  between  the  low 
energy  fricatives  /v/  and  /;/? 

2)  How  may  a retroflexed  /t/  be  distinguished  from  a non-rctroflexed  /t/? 

« 

3)  Which  has  the  greater  effect  on  /s/  duration,  nasalization  or  rounding? 

Another  important  question  we  may  ask  is  whether  specific  allophonc  effects  are  consistent 
from  one  phone  to  another  in  the  same  phone  class.  For  example,  does  a nasalized  /t/  as 
ccmpaied  to  a non-nasali/.ed  / t/,  bear  a similar  relationship  to  a nasalized  /p/  compared  with  a 
non-nasalized  /p/7  In  short,  arc  there  regi  lar  acoustic  transformations  whi:h  characterize  the 
nature  of  the  different  rllophone  classes  themselves?  With  respect  to  vowels,  similar  kinds  of 
questions  and  mcthodol(<gy  were  adopted  by  Gerstman  (GI],  who  found  that  classification  of  a 
single  speaker  :,  vowels  in  two-formant  space,  could  be  accomplished  given  only  the  first  two 
formants  of  2 or  3 known  referent  vowels,  and  the  first  two  formants  of  each  of  the  others. 

We  have  performed  pair-wise  phone  recognition  tests,  based  on  the  alloplune  efieci.,  derived 
from  single  instances  of  allophoncs  from  the  combined  set  of  /b/,  /p/.  /g/,  and  /d/  by  the  same 
speaker.  These  recognition  tests  compared  6 measures  of  the  release-aspiration  ..cginents  of  the 
stop  consonants  for  the  comparison  of  fricatives,  the  same  6 measures  were  used  to  characterize 
frication  acoustic  segments.  Except  for  a duraii'  n measure  of  the  acoustic  segment  itself,  the 
other  rnc.-isurr  ■ were  averages  ot  tiic  individual  cycle  parameters  observed  during  the  course  of  the 


Chaplor  IV  AU.OIM  lONKS  OF  I RK  A FIVES  AND  S FOI*  CONSONAN I S 


whole  acoustic  scgracnf.  The  6 measures  utili/cd  were  1)  cycic-fregucncy,  2)cyclc-frequency 
dispersion,  3)  duration,  4)  total  variation,  5)  microstructure,  and  6)  absolute  amplitude.  The 
cycle-frequency  dispersion  measure  has  not  been  previously  mentioned;  it  is  simply  the  standard 
deviation  of  all  the  cycle-frequency  values  observed  within  a given  acoustic  segment.  The  value  of 
this  measure  is,  of  course,  higher  for  a voiced  frieation  segment,  for  example,  of  /z/  a.s  compared 
to  the  unvoiced  frieation  segment  of  its  counterpart  /s/.  In  the  former  case,  there  are  voicing 
cy.cles  with  much  lower  cycle-frequency  values  mixed  in  with  high  cycle-frequency  values,  and  in 
the  latter  case  only  high  cycle-frequency  values  arc  observed. 

For  these  tests,  rare  and  occasionally  occurring  allophones  were  excluded  because  these 
allonhones  were  generally  unfamiliar  to  the  .speakers  themselves,  and  were  therefore  prone  to 
articulation  errors.  In  addition,  since  many  of  them  applied  only  to  a single  phone  or  to  a single 
voiced  unvoiced  phone  pair,  they  could  not  be  used  for  general  phone  comparisor  ''  In  addition, 
release-aspiration  segments  containing  a total  of  3 or  fewer  cycles  were  excluded  from  consider.i- 
tion'duc  to  the  unreliability  of  such  small  sample  size  in  standard  deviation  cumputations  and  so 
forth. 

For  the  stop  consonants,  the  referent  allophone  against  which  all  the  other  allophones  were 
compared,  was  the  aspirated  form  of  each  ihone  For  the  fricatives,  the  referent  used  for  each 
phone,  was  the  the  unmodified  form.  F('.'  each  speaker  then,  we  computed,  for  each  of  the  6 
measures  of  each  allophone,  its  ratio  with  the  corresponding  measure  of  the  appropriate  referent 
allophone. 

Given  the  h\pothesis  that,  for  cxaniple.  the  release-aspiration  portions  of  the  stop  consonants 
are  iriherently  acoustically  charae'erized  by  such  measures  as  we  ha>e  chosen,  and  that  regular 
acoustic  transformations  of  allophones  exist,  and  may  be  expressed  relative  to  the  individual  phone 
referents,  we  performed  the  following  experiment.  For  a given  speaker,  we  haw  I instance  each, 
of  the  aspirated  allophones  i f , g./,  /k  , / b/.  p/.  /d/.  and  t.  . I hese  are  our  referents.  As 

previously  described,  we  compute  for  each  measure  of  each  of  the  other  allophones.  the  ratio 
R„p(i).  of  its  value.  V^^ii).  to  the  orrespomlmg  value  of  its  referent  allophone,  V^p(i). 


i)R„,p(i)  = v„,^(i),/v,,(i) 


Chapicr  IV  - ALLOIM  lONFS  OF  FRICA  H VKS  AND  S I Ol*  C ONSONAN I S 


where  V^p(i)  = value  of  parameter  i for  alloplione  m of  phoneme  P 
and  alloplione  ra  = o represents  the  referent  value. 

If  the  hypothesis  described  above  is  cerrcct.  we  hope  to  discriminate  with  distance  measures, 
between  2 phones,  regardless  of  which  allophonc  of  these  two  is  being  tested,  solely  on  the  basis  of 
knowing  I)  the  aspirated  referents  for  each  of  the  2 test  stop  consonants,  and  2)  the  average 
relationship  of  each  allophonc  type,  as  derived  from  the  combined  alloplione  ratios  of  the  training 
set  of  the  other  4 phonemes,  not  being  tested. 

Ihercfore: 

2)  A„t.)  = ,R„  „(i)/q) 

where  q = number  of  phonemes  in  training  set. 

■ 3)oJi)  = (i:,.,,(R„,,(i)-A,„(i))^'(q-l))''- 

We  may  then  predict  a prototype  value  of  parameter  i for  a given  allophonc  type,  for  any 
phoneme  for  which  we  ha'c  a referent  value  of  parameter  i. 

4)  F„(i)  = V,  ,,A„(i) 

’ m'  ' n.q  1 m'  ' 

w here  phoneme  p = q + I represents  the  phoneme  for  '■  liieli  the  predicted  prototype  is  to  be 
eomputed. 

Finally  we  take  a modified  Fiuclidcan  distance  measure  between  each  of  the  predicted 
alloplione  parainct  -r  values  and  t!ie  test  phoneme  sample  aaranietcr  values 

where  P„,(i)  = predicted  prototype  value  of  parameter  i 
S^^(i)  = test  phoneme  sample  value  of  parameter  i 
o,„(i)  = standard  deviation  of  parameter  i 


I 


Chapter  IV  - ALLOIM  lONKS  OF  FR!C  A I IVKS  AND  S I OP  CONSONAN  I S 

Tn,  = distance  measure  of  allophone  m ol  selected  phoneme. 

With  this  measure,  chance  predicts  50%  correct  choices  anj  50‘Ki  incorrect  choices,  in 
choosing  one  of  the  two  test  phonemes.  Note  that  this  test  paradigm  is  very  strenuous  for  several 
reasons.  For  each  speaker,  only  one  example  of  each  allophone  exists  in  the  data  base.  Of 
necessity,  the  sample  space  from  which  to  derive  allophone  effect  statistics  is  very  limited. 
Therefore,  in  discriminating  between  any  2 stop  consonants,  only  4 samples  ( one  from  each  of  the 
other  stop  consonants),  at  best,  are  available  for  providing  the  derivation  or  training  for  each  of 
the  different  allophone  characteristics. 

Clearly,  any  errors  in  this  trainir.,^  .set  may  assume  a large  effect.  In  the  original  allophone 
classification  of  the  data,  the  assumption  was  made  that  if  a phone  occurred  in  a given  environ- 
ment, it  was  necessarily  coarticulated  with  that  environment,  and  therefore  was  labeled  as  such. 
However  it  is  not  necessary  that  a phone  be  caarticulated  with  its  environment;  it  may,  and  in  fact 
often  does,  remain  unmodified  by  its  environment.  Nevertheless  its  allophone  label  is  the  same  as 
though  it  were  coarticulated.  For  example,  in  the  utterance  "sip  more",  the  /p/  is  labeled  in  the 
data  as  bein''  a nasalized  allophone  of  /p/  , whether  it  is  acoustically  nasalized  or  whether  it  is 
acoustically  the  same  as  th  jnmodified  allophone  of  /p/  in  the  utterance  "sip  it".  Therefore,  it  is 
very  likely  the  case,  that  unmodified  allophones  are  averaged  in  with  modified  allophoncs  in  the 
training  sets  for  the  various  kinds  of  modified  allophones,  and  contribute  a source  of  error. 

The  results  of  this  experiment,  however,  are  surprisingly  good.  We  obtained  scores  for  the 
tests  of  pair-wise  phone  discrimination,  both  for  phones  with  same  and  different  places  of 
articulation.  Of  particular  interest  are  the  results  for  discrimination  of  phones  with  different  places 
of  articulation,  such  as  between  /p/  and  /t/,  for  example.  Therefore  in  the  tests  for  all  allophones 
cf  I ) /g/  and  /k/  vs  the  allophones  of  /b/,  /p/,  /d/,  and  /t/,  2)  /b/  and  /p/  vs  /g/,  /k/,  /d/, 
and  /t/,  and  3)  /d/  and  /t/  vs  /g/,  /k/.  /b/.  and  /p/,  the  following  scores  were  obtained; 

Different  Place  of  Articulation 


Male  I - 75% 
Male  2 -86% 


Chapltr  »V  - AI.IOIMIONMS  OK  I RICATIVES  AND  STOP  C ONSONANTS 


FcmaL'  1 - 71% 

riu’sc  tesis  assume  that  no  knowledge  is  available  concerning  the  acoustic  environment  in 
which  an  unknown  phone  occurs,  Ihis  is.  in  fact,  lardly  ever  the  ease  in  speech  segmentation 
and/or  labeling,  processes.  Recall  that  in  these  tests,  an  enknown  allophonc,  belonging  to  one  of 
the  two  phonemes  tested,  is  being  compared  against  a predicted  set  of  parameter  values  for  each 
allophonc  of  both  test  phonemes,  and  that  the  specific  a'lophone,  characterized  by  these  predicted 
values,  with  the  shortest  distance  measure  from  the  unknown  allophone  determines  the  test 
phoii«."^e  chosen  by  each  test.  With  speech  processing  programs,  it  is  eery  often  the  case  that 
information  is  available  concerning  ihe  nature  of  the  actual  context  of  an  unknown  stop  conso- 
nant. for  ex-  niple.  In  all  likelihood,  given  such  information,  the  probability  of  confusing  similar 
Slop  eoiisoiuiils  would  be  diminished,  f his  kind  of  redundaney  could  prose  to  be  quite  useful  for 
.iccuratc  labeling  for  e.xainple.  if  in  analyzing  an  unknown  stop  ccmsonaiil.  the  distance  measures 
•or  a nasalized  /p/  and  a rounded  . k/  were  close,  one  could  look  for  contextual  evidence  of  a 
nasal  to  help  resolve  landing  confusion.  '!‘his  kind  of  contextual  lest  is  generally  not  difficult, 
since  it  is  not  necessary  that  the  contextual  phones  be  actually  identified,  but  i nly  their  general 
phone  classes.  Ilow.  ser..  :•  ; I o likely  that  for  optimal  labeling,  accurate  statistics  are  required 

on  indl  Idual  speaker  coariieulalion  phenomena. 

Next,  vse  ex  imiiie  the  less  critical  discrimination  of  stop  consonants  with  the  sarr'C  place  of 
articulation;  specilicallv,  e vs  /k  ',  .-'b/  vs  /p/.  a id  /d/  v.s  , i/.  .Mthough  all  the  voiced  stops 
were  preceded  by  voicing,  we  did  not  use  this  important  cac  for  discrimination,  but  rather 
compared  only  the  release-aspiration  segments. 

The  scores  oln.iined  for  our  three  .speafers  follow: 

Same  Phiec  of  Ariieulaiion 


Mule  1 - 


Male  2 - it')  it 


Female  1 - 44% 


Chapter  IV  - ALLOIMK)NKS  OK  I RIC  A HVKS  AM)  SlOi’  CONSONAN  I S ‘^2 


These  mueh  lower  discrimination  scores  indie  itc  the  high  degree  of  acoustic  similarity  between 
the  release-aspirations  of  voiced-unvoiced  phone  pairs.  In  many  cases,  a voiced  stop  goes 
de-voiced  at  or  very  near  the  start  of  its  release-aspiration.  A fter  thai  point,  the  release-aspiration 
of  the  voiced  stop  is  very  sinnlar  to  that  of  the  unvoiced  phone  with  the  same  place  of  articulatio, . 
However,  it  is  also  true,  that  certain  consistent  differences  generally  do  prevail  between  voiced 
phones  and  their  unvoiced  counterparts,  the  shorter  duration  of  the  voiced  phones,  for  example. 
In  addition,  we  note  that  the  discrimination  scores  for  the  female  speaker  were  lower  than  those  of 
the  two  male  speakers.  1 his  may  be  due  to  the  higher  ambient  noi.se  level  in  the  recordings  of  her 
voice. 

We  performed  the  ime  pair-wise  phone  recognition  tests  among  the  8 fricati’cs  picviously 
enumerated.  For  the  fricatives  then,  the  training  set  consisted  of  all  (he  allophonc  data  of  6 
fricative::  The  scores  follow  for  the  pair-wise  discrimination  of  fricatives  with  different  places  of 
articulation;  that  is,  I)  /s/  and  hi  vs  /H , /V,  /7,  /'/,  /f/,  and  /v/,  2)  / / and  /?/  vs  /s/,  /z/, 
/V.  /f/,  and  /v/.  3)/'7  and  /'/  vs  /s/,  /z/,  h! , /■/,  /f/,  and  /v/,  and  4)  /f/  and  /v/  vs 
/s/,  /z/,  /y/,  HL  A/,  and  / /. 

Different  Place  of  Articulation 


N,alc  i - 95% 

Male  2 - 91% 

Female  1 - 83% 

As  for  the  stop  consonants,  contextual  information  was  not  used  for  these  phoneme  discrimi- 
nation tests.  In  actual  practice,  we  expect  that  use  of  such  generally  available  ini^ormation  would 
improve  phoneme  discrimination  or  labeling  even  further  Discrimination  scores  for  fricatives  with 
the  same  place  of  articulation;  that  is,  /s/  vs  /z/,  /'/  vs  /7,  /-/,  vs  //,  and  /f/  vs  /v/,  as 
expected,  are  somewhat  lower. 

Same  Place  of  Articulation 


Male  I - 83% 


Male  2 - 75'X) 


( hiiplir  IV  - Al  l OIMIONHS  OF  FRK  A FIVES  AND  SFOI*  ( ()N'X)NAN TS  93 


I cinalc  I - 75'’i> 


'.Vc  iMlc  lhai,  on  the  whole,  the  discrimination  scores  for  all  the  speakers,  both  for  phonemes 
with  the  same  and  those  with  diffeient  places  cf  articulation,  are  somewhat  higher  for  the  fricatives 
than  for  the  stop  consonants.  This  may  be  due  in  part  to  increased  stability  with  the  larger  set  of 
training  data  for  the  frieatises,  with  6 phonemes  rather  than  4.  Another  crintributing  factor  for  the 
difference  may  be  the  fact  that,  on  the  average,  the  duration  c'  frication  is  substantially  longer 
than  that  for  the  release-aspiration  of  slops.  And  statisties  gathered  over  longer  duration  segments 
may  be  more  stable.  Specifically,  in  examining  the  release-aspiration  segments  of  all  the  stop 
consonant  allophones,  we  find  that  IO‘)t>  of  them  last  for  less  than  5 msec  in  duration,  and  8%  of 
tiiem  last  for  less  than  3.3  msec  (usual  wide  band  spectrogram  resolution)  in  duration.  In  normal 
continuous  speech,  these  percentages  arc  probably  higher  yet.  On  the  other  hand,  only  I ‘hi  of  the 
Iricaiion  .scgnienls  were  less  than  10  msec  in  duration. 


I his  table  summari/es  all  the  results  just  presented. 


f .'nrorv  of  A1  lophono  Tc-sf..'-. 


. top-.; 

!•'.  f t . 

:i  iT.i-  t'  1 i<  ■ • 

f f 1 cot  Ivor,: 

n.tf.  n ic,: 


Ho  lo 


f'.ilo  2 


Fcma 1 c 1 


7'j.r  ( C 7»o  ' Sr.*  ( f-7/70  ) 

I 12/i  a ) 69>,  I ) )/16  ) 


71?  I'i9/C.9) 
aa?  (7/161 


95^  ( U7/i..a  ) oi (118/129) 
8 1,-  ( 20/:a  1 75.?  ( '5/20  ) 


33%  ( luO/120  ) 
75?  ( ib/?0  ) 


In  Appendix  C there  are  matrices  presented  which  show  for  each  speaker  the  specific 
pair-wise  phone  comparison  test  results,  with  the  results  combined  for  different  and  sam'’  place  of 
articulation 


The  following  tables  illusiriitc  the  significance  of  using  allophonc  information  for  identifying 
phonemes  accurately,  faking  each  parameter  in  turn,  we  show  for  each  of  the  three  .speakers,  the 
relationships  of  the  phonemes  chaiaeteri/.ed  by  these  parameters.  First  we  give  the  range  of  values 
assumed  by  each  of  the  phonemes.  Clearly  there  is  a great  deal  of  overlap  among  these  phoneme 
ranges.  However,  if  we  compute  the  average  ratio  of  each  allophonc  type  to  its  phone  referent,  we 


Chapter  IV  - ALi.OPliONES  OF  FRICA  FIVES  AND  STOP  C ONSONANTS  ^4 


find  that  the  diricrent  alluphonc  types  each  adhere,  rather  consistently,  to  speciric  regions  within 
these  ranges.  Therefore  the  search  space  for  a given  allophone  is  significantly  narrowed.  So  next 
we  present  the  allophone  ratios  for  all  the  phones  of  all  the  speakers  combined  Note  that  the 
separation  of  these  allophone  ratios  for  the  fricatives  is  much  smaller  than  for  the  stop  consonants. 
Separate  analyses  of  averages  and  standard  deviations  for  all  the  allophone  types,  for  each  speaker, 
are  included  in  the  Appendix.  Note  abo  that  although,  for  a given  paiameter,  two  allophone  types 
may  occupy  similar  positions,  for  another  parameter,  these  allophone  types  may  occupy  distinctly 
different  positions.  It  is  the  total  pattern  of  the  different  parameter  relationships  which  character- 
ize any  given  allophone.  In  these  tables,  the  foliowing  abbreviations  are  used  for  the  different 
allophone  types: 


- unmodified 


A palatalized 
C .minimally  released 


H aspirated 

N nasalized 

R retroflexed 

T dentalized 

W rounded 

STOP  CONSONANTS 

MALE 

CYCLE-FREQ  CF  PISP 

DURATION 

TV( NORMALIZED ) 

MICROSTRUCTURF 

ABSAHP 

PHONE 

['  KKIIz)  P 

P K 3(  msec  ) 

P na 

I'  (15 

P »6 

nun-max  min-n.ax 

mi n-max 

min-max 

mi n-max 

min-max 

/g/ 

K2a-263«  622-1702 

13.2-02.9 

317-890 

73-315 

33-97 

/k/ 

12la-aiil  a8a-l700 

23.2-79.3 

536-1233 

95-291 

99-136 

/b/ 

996-2^99  103-1119 

3. 0-6.0 

553-779 

69-170 

93-67 

/P/ 

1663-2^85  93-1139 

2.9-33.9 

618-790 

66-186 

55-81 

/d/ 

1501-3778  6^8-21^8 

1 . 3-35. 1 

569-1193 

116-269 

13-119 

/t/ 

1636-^^91  7M-3352 

10-9-101.2 

676-1953 

1 30.999 

22-132 

Cluplrr  !V 


ALUMMIONI-S 


FRIC'A  I IVKS 


ANI)  Slop  CONSONANIS 


Page  95 


OK 


Male  K? 


f'A.‘ 

P'i9-?4  o 

1 2 1 -962 

1 

o 

o 

1 25 1-2610 

laB-469 

1.8-154.6 

0/ 

1 -2  180 

478-673 

1 . 9-29.7 

p/ 

15. '4 -2 38 3 

54  '-894 

3.4-1 10.2 

/d,' 

2144-2874 

754- 1648 

1 . 1 - 1 1 0 . 0 

/t ' 

't.‘  1-32  18 

546-  1 544 

10.4-1  2'2''.  4 

1 F.H/M 

,i:  »i 

.1/ 

60  '-284  3 

1 8 4 - 11 6 i 

1 1 . 9-24  . 3 

/k.' 

1263  3131 

228-550 

26.2-99.2 

'b/ 

442-1867 

26 1 - n 54 

3 . .4-  20 . 2 

/P/ 

1028- 19  1 1 

2G8- 1278 

10.4-32.8 

'd  ' 

1410-3739 

535-1597 

1 . 0- 36  . 1 

/t/ 

1 189-2708 

744- 1 167 

6. 9-88.7 

AVERAGE  /3LI,0PHuNE  KAPUl  P 
pci  P 

KHCK'J  r/\{  it*  j 
P fi  3 

P 

66 X N 74 t 

(' 

N 

K 

'4  K 

'■  i 

'1 

r 

■>. 

84  (• 

w 

SI 

1'4 

A 

'■  < 

*v 

ti 

10O  w I.:; 

- 

- 

1 OH  - 1 ^.6 

i 

S9 

- 

T 

110  1 1 C 3 

H 

76 

I 

A 

219  A 186 

H 

100 

A 

342-830 

22-  123 

55-104 

4 14-834 

31-78 

32-214 

628-756 

97-135 

31-62 

609-795 

82- 1 20 

29-110 

721-952 

108-261 

23-70 

698-  1088 

82-310 

26-37 

489-571 

4 1 -443 

59-162 

359-568 

43-332 

24-104 

945-730 

65-207 

18-58 

413-708 

64-147 

24-45 

493-549 

58-245 

41-103 

428-794 

43-159 

34-138 

#4 

p 

p5 

P 

Pi- 

80% 

R 

G'6  K 

T 

52; 

rtS 

N 

69 

2 ' 

K, 

81 

91 

92 

A 

90 

9 i 

!! 

100 

N 

Q 1 

100 

W 

107 

H 

vin 

1 ()6 

- 

1 17 

P 

lOB 

1 IS 

T 

162 

- 

1 ' '1 

144 

A 

187 

W 

122 

Chapter  IV 


ALLOPHONES  OF  FRICAIIVES  AND  STOP  CONSONANTS 


Page  96 


FRICATIVf  ' 


Male  «1 


CVCLE-FREQ 

CF  DISP 

DURATION 

TV{ NORMALIZED ) 

MICROSTRUCTURE 

ABSAMP 

PHONE 

P «1(hz) 

P #2 

P K3(msec) 

P #4 

P »5 

P #6 

nun-max 

min-max 

min-max,. 

min-max 

min-max 

min-max 

/f/ 

3293-5663 

2177-3206 

106.3-184.7 

1 186-1494 

315-360 

32-38 

/V/ 

1258-3772 

693-2564 

28.7-62.2 

672-1177 

235-323 

42-63 

/^'/ 

2667-3286 

1704-2017 

135.6-186.4 

1079-1249 

324-408 

27-32 

.'!>/ 

1517-2688 

1053-2069 

13. 1-92.2 

674-920 

201-348 

43-52 

/// 

3673-3849 

1559-1627 

190.2-237. 1 

1092-1 1 37 

169-194 

168-217 

/?/ 

3112-3676 

1163-1865 

92.2-140.8 

9S.  ’^90 

147-193 

153-256 

/s/ 

4561-6328 

1121-1529 

151  .2-204.3 

1303-1521 

212-265 

146-284 

/z/ 

4738-4976 

1253-1472 

80.9-158.9 

1240-1333 

215-234 

114-191 

Male 

«2 

/f/ 

1871-2646 

927-1129 

126.7-211 .5 

775-913 

175-234 

31-42 

/V/ 

1964-2376 

987-1325 

41  .7-75. 1 

769-914 

186-256 

28-44 

/‘V 

1607-2520 

817-1298 

72.6-182.0 

826-101 1 

92-265 

26-34 

/V 

1637-2112 

565-1200 

16.5-28.7 

656-833 

102-204 

28-76 

/J/ 

2678-3158 

864-1097 

196.8-274.6 

857-976 

105-132 

218-285 

/</ 

2366-3167 

723-1057 

127.3-174.7 

796-972 

104-130 

112-157 

/s/ 

3735-4771 

725-1 186 

199.5-294.3 

1221-1302 

174-223 

74-1 10 

/z/ 

4175-4674 

689-1355 

1 16.0-183.4 

1170-1259 

183-221 

42-86 

Female 

#1 

1280-2562 

789-1039 

34.5-160.5 

632-909 

182-214 

24-32 

/V/ 

556-1118 

260-472 

9.4-20.0 

329-373 

38-71 

42-77 

/a 

980-1390 

6G1 -847 

42 . 7-46.4 

550-561 

146-235 

20-59 

/3/ 

647-1371 

651-815 

40.5-72.9 

364-470 

89-134 

38-268 

/// 

2820-3744 

576-1526 

223.1-262.7 

893-1097 

100-157 

128-159 

/</ 

1512-2905 

1017-1135 

70.7-1 18.4 

528-926  . 

124-160 

82-104 

/s/ 

2690-4564 

1248-1984 

90.4-235.2 

973- 1264 

218-361 

27-132 

/z/ 

1740-4353 

1406-2659 

31.7-113.1 

693-2178 

0 

1 

37-83 

AVERAGE 

ALLOPHONE 

RATIO  PERCENTAGES 

P 

«1 

P 

t2 

P 

»3 

P 

#4 

P 

#5 

P 

«6 

T 

99* 

N 

95S 

T 

74X 

W 

97JI 

P. 

90!l 

T 

73!t 

W 

100 

- 

100 

R 

87 

T 

100 

- 

100 

W 

98 

R 

100 

T 

100 

N 

88 

- 

100 

N 

101 

- 

100 

- 

100 

W 

107 

W 

91 

N 

101 

W 

102 

R 

107 

N 

105 

R 

110 

- 

100 

R 

1 10 

T 

1 1 7 

N 

133 

C Siaptcr  V - MOi.lNS,  BOD-UOU  SIlUIKKS,  AND  OI  IIHR  AI*I*UCA  I IONS  Pape  97 


MUSIC  - 176)/. /A, S' 

Of  course,  since  time-domaiii  an.ilysis  is  penerali/able  to  any  waveform,  we  may  also  examine 
non-speeeli  sipnals,  with  tlie  luipe  of  adilressiiip  questions  of  eharacterizatioii  in  peneral.  as  well  as 
specific  issues  re:  transient  phenomena  and  temporal  measures,  l-or  example,  in  conjunction  with 
Schumacher  |H3|.  wc  have  briefly  ex  mhned  some  violin  and  cello  music  recorded  under  recital 
conditic.ns  and  dipiti/cd  at  ID  kll/,  Cienerally  the  waveforms  of  these  instruments  resemble  the 
vowel  portions  of  human  speech.  "Pitch  pcriotls”  are  readily  apparent  throuphout.  However  the 
time-domain  analysis  reveals  that  these  pitch  periods  are  not  perfectly  repular.  The  individual 
cycles  Within  a piven  pitch  period  may  fluctuate  sliphtly  from  those  in  successive  pitch  periods. 
Rapid  fluctuations  ul  this  type  have  been  referred  to  as  "jitter"  in  perception  experiments  (H9]. 
This  jitter  is  evident  in  the  aecompanyinp  I IP  plot  (l-'ip.  4.‘5)  of  the  open  C strinp  (G3).  of  the 
violin,  whicli  at  :ibout  I ..‘'3  see  is  followet'  by  the  first  finuer  position  (A3),  played  on  the  Ci  string, 
with  vibrato. 

Note  first  that  the  clear  horizontal  lines  are  not  perfectly  straight  and  repular.  The  jitter  of 
successive  pitch  periods  is  rellected  in  the  lop  inverse  period  or  eyele-freqnency  measure  displayed. 
Perception  experiments  by  I’olhick  |P2!  indicate  that  such  jitter  may  be  perceptible  to  human 
listeners.  In  fact,  it  may  well  be  that  these  irregularities  are  .i  distinpiiishinp  feature  of  instrumental 
music.  For  example,  we  find  this  difference  between  true  violin  music  and  syntliesi/.ed  violin  music 
which  is  built  up  as  a simple  eonipt  site  of  sine  waves  or  other  repular  waveforms.  In  the  waveform 
displayed  above  the  TIP  plot,  vertical  lines  have  been  automatically  drawn  pitch  synchronously, 
f ollowing  each  line  is  a mim'ner  measuring  the  duration  of  the  previous  period.  This  measure  is  in 
units  of  eentiseconds;  therefore  a 50  eentlsee  duration  corresponds  to  a fiequency  r>f  200  Hz. 

A longer  term  fluetiialion  is  that  caused  by  the  vibrato.  The  rapirl  changes  of  linger  pressure 
create  both  amplitude  and  eycle-fiequency  modulations.  1 he  .\3  note  eommeneing  at  about  1.53 
sec  is  played  with  vib  >.  Although  not  much  of  this  note  is  shown  heie,  the  oscillations  of 
eyclc-freqiieneics  can  he  seen.  I he  waveform  amplitude  modulations  are  more  e.asily  seer  in  the 
expanded  waveform  shown  in  l ip  4b.  I he  full  expanded  waveform  shown  here  lasts  2 .seconds, 
and  should  be  read  from  left  to  right,  and  from  top  to  bf)ttom.  l or  any  given  line,  the  duration  of 


Chapter  V - VIOLINS,  BOU-BOU  SHRIKES,  AND  OI  IIER  AIM’LICAI  IONS  Page  9X 


time  between  vertical  lines  is  40  centiseconds  (.04  seconds).  The  note  A3  commences  at  about 
206  centiseconds.  Here  we  sec  about  2 1/2  periods  of  amplitude  modulation  in  the  waveform 
envelope  for  A3.  The  musician  playing  vibrato  here  typically  incorporated  5 or  6 periods  of 
amplitude  modulation  per  second  For  comparison,  observe  the  lack  of  such  periods  during  the 
playing  of  the  open  G string.  The  only  major  change  in  amplitude  seen  here  is  a gradual  increase, 
which  reflects  increased  bowing  pressure.  The  LIP  plot  may  be  directly  related  in  time  to  the 
expanded  waveform  if  one  equates  the  50  centisecond  origin  of  the  LIP  plot  and  converts 
centiseconds  to  seconds,  or  vit^  versa.  Therefore,  the  I.C  second  marker  on  the  LIP  plot  corre- 
sponds to  the  150  centisecond  marker  on  the  expanded  waveform. 

Next,  note  the  apparently  sharp  discontinuity  in  LIP  pattern  occurring  at  about  .67  sec.  This 
change  does  not  reprr  >cnt  a change  of  note.  Examination  of  the  waveform  above  the  LIP  plot 
shows  that  both  hefo-  • and  after  this  discontinuity,  the  pitch  period  durations  are  essentially  the 
same.  We  know  that  the  friction  of  bowing  excites  the  inherent  resonances  of  the  violin.  The 
discontiii'iity  seen  here  is  piobably  a threshold  response  to  the  gradual  change  of  bowing  pressure, 
which  elicits  a shift  in  the  relative  amplitudes  of  the  cycle-frequency  components  within  the  pitch 
period. 

Finally,  note  in  the  L.IP  plot,  the  large  irregularities'  occurring  at  major  transition  regions. 
These  transition  regions  reflect  I)  the  beginning  of  bowing  at  about  .23  see  and  2)  the  change  in 
note  and  bowing  direction  at  about  1.53  sec.  It  is  known  that  such  transitional  regions  provide 
important  cues  for  musical  instrument  identification. 

From  an  analysis  point  of  view,  it  may  be  possible  that  perception  experiments  based  on 
digital  manipulation  of  such  transition  regions  could  elucidate  information-bearing  components  of 
these.  More  generally,  time-domain  analysis  may  prove  helpful  in  providing  better  descriptions  of 
individual  musical  instruments  as  well  as  clarification  of  certain  musical  phenomena;  e.g.  jitter. 
Alternatively,  the  incorporation  of  some  of  these  phenomena  may  allow  a better  synthesis  of 
certain  musical  .instruments  than  is  presently  available. 


ri'SIffiiCf 


SCALE  UITH  il)  — PAGE  1.1  VLNl.A 


Chapter  V - VIOI.INS,  BOU-BOIJ  SHRIKES,  AND  OlllER  AIMMK  A HONS  Pnf;c  101 


ASIMAL  l OCAUZATlONS  - BOV  ' OU  SHRIKES 

The  bou-bou  shrike  (Laiiianus  aethicpicus)  winch  is  also  known  us  the  bell  shrike  or  bellbird 
is  native  to  liost  Africa.  Its  sonp.  largely  devoid  of  harmonic  content,  is  almost  sinusoidul 
acoustically  ar  ! sottnds  quite  melodic  to  the  car.  Surprisingly,  not  all  the  individual  notes  in  the 
song  are  prc  inced  by  a single  indo  'dual  bird,  but  rather  by  two  separate  birds  whose  vocalizations 
arc  highly  s>i:;hronize<l.  This  fact  s apparent  however  to  an  observer  standing  between  the  two 
birds  and  hearing  the  notes  emanating  from  sepa'ate  direction*.  This  kind  of  song,  usually  begun 
by  one  bird  and  finished  by  the  other,  is  called  "antiphonal  song",  a form  of  the  more  general 
duetting  ph' nomcna.  The  separate  voed  contributions  r.T  the  two  birds  may  overlap  in  time, 
completely,  partially,  or  not  at  all.  The  repetoire  of  prcci;c  song  patterns  sung  by  a pair  of  birds 
may  be  quite  large,  with  so"-.c  patterns  identical  to  thos"  of  other  local  pairs  of  the  same  species, 
and  still  other  patterns  unique  to  the  specific  bird  pair  itself.  Geneially  this  antiphoiiai  song  is 
characteristic  of  a ntale-fcniale  pair  mated  for  life,  with  a relatively  permanent  year-round  territory 
amid  dense  tropical  foliage.  Il  is  thought  (T2i  that  antiphonal  sn.^mg  oper  tes  as  a mecl  inism  foi 
confirming  and  maintaining  pair  contacts  ’vherc  visual  cues  are  practical;  absent.  Since  males  and 
females  have  identical  physical  form  and  color  characteristics,  their  song  conveys  the  song  and 
individcal- specific  information  necessary  foi  identification. 

Thorpe  I”'  ij  noted  that  for  ljai>’  tritis  erythrogaster,  a "reaction  time”  could  be  defined  as  the 
period  of  time  between  when  the  i.rst  bird,  A.  started  singing,  and  when  the  second  b'rd.  C, 
cor  menced  singing.  He  ftmnd  that  this  reaction  time  was  quite  consistent,  even  though  the 
duration  of  each  of  the  bird's  vocalizations  could  be  quite  variable  He  reports  that  for  a scries  of 
8 duets  by  i .e.,  a reaction  time  of  144  msc>,  with  a standard  deviation  of  12.6  n’sec  ( = of 

reaction  omc)  was  observed.  He  also  notes  observatio  s of  others  on  I ) a series  of  7 duets,  L.a., 
with  a reaction  time  of  42.‘5  msec  and  a standard  devia"  oi  of  4.h  msec  (=  I ' of  reaction  time) 
and  2)  a scries  of  6 duets  ,iy  Ci'iicola  chuhbi.  with  a reaetion  time  of  3')6  m stain'  <rd 

deviation  of  2.9  msec  (=  .7.3'!(>  of  reaction  time).  Thorpe  bases  his  measures  on  spcctroeranhic 
data  and  cininis  accuracy  within  I ..S  msec. 


Chapicr  - VIOLINS,  BOU-BOU  SHRIKES,  AND  OTIIEP  AIM'LICATIONS  Page  102 


The  Cornell  Library  of  Natural  Sounds  has  generously  made  available  to  us  recordings  of 
various  animal  vocalizations.  The  examp  ussed  here  include  recordings  of  La.  ( recorded  by 
M.  P.  McChesney  ) and  Melospiza  melodia,  the  song  sparrow  ( recorded  by  R.  C.  Stein  and  R.  S. 
l ittle,  #RCS63-99j  ) On  a series  of  4 duets  of  L a.,  we  have  observed  a reaction  time  of  294.4 
iiistc  with  a standard  deviation  of  2.108  msec  (=  .72%  of  reaction  time).  These  measures  are 
based  on  our  time-domain  analysis,  as  previously  described  for  human  speech.  Temporal  resolu- 
i.on  at  this  samp'hg  frequency  of  20  kHz  is  50  triicroseconds. 

Thorpe  notes  that  reaction  times  can  be  exn>rcted  to  vary  as  a function  of  I)  the  difLrcnccs 
oetween  the  individual  pairs  singing  any  given  song  pattern,  2)  distance  between  the  duetting 
birds,  and  3)  the  current  activity  of  the  responding  bird.  The  following  illustrations  show  compar- 
itively  the  spectographic  and  LIP  displays  for  the  same  vocalizations.  First  is  an  example  of  a 
/ocalization  by  L.a.  (reaction  time=  185.0  msec),  portrayed  by  !)  a wide-band  spectrogram 
(Fig.47),  2)  a narrow-band  spectrogram  (Fig.48\  and  3)  an  LIP  plot  (Fig.  49).  Temporal 
resolution  for  each  of  these  is  1)  3.3  msec,  2)  22.2  msec,  and  3)  50  microscc,  respectively. 
Frequency  lesolution  on  these  graphical  displays  is  1,  300  Hz  and  2)  45  Hz  for  the  wide  and 
narrow  baud  spectrograms,  respectively:  uninterpolatcd  cycle-frequency  rcsohition  of  the  LIP  plot 
is  iti.ersely  proportional  to  the  cycle-frequency  measured.  At  cycle-frequencies  of  10  Hz,  100  Hz, 
and  lOOJ  Hz,  the  cycle-frequency  resolution  is  .05  Hz,  .5  Hz.  and  50  Hz,  respectively,  given  the 
sampling  rate  of  20  kHz.  Note  that  the  time  scale  for  the  LIP  plot  is  slightly  reduced  (3%  less) 
than  ior  the  spectrograms,  and  that  the  y-axis  of  the  spectrograms  is  linear  whereas  the  y-axis  for 
the  LIP  dot  is  logarithmic.  The  bou-bou  shrike  vocalization  shown  in  the  LIP  plot  corresponds  *o 
the  first  of  two  shown  in  the  spectrograms. 

Ev*;r.  in  relatively  noisy  recordings  of  animal  vocalizations;  e.g.  extraneous  background  bird 
vocalizations,  time-domain  analysis  appears  rather  noise  resistant  and  details  of  the  primary 
vocalizations  arc  i sually  still  quite  apparent.  As  revealed  in  the  time-domain,  bird  A is  singing 
consistently  at  an  average  cydc-frcqucncy  of  1024  Hz  ( cycic-frcqucncy  dispersion  = 12.85  Hz 
which  is  1.25%  of  the  average  cycle-frequency  ).  whereas  bird  3 is  singing  at  an  average  cycle- 
frequency  of  828  Hz  ( cycle-frequency  dispersion  = 19.92  Hz  which  is  2.14%  of  the  average 
cycle-frequcncy  ).  We  have  observed  another  bou-bou  shrike  which  sang  even  more  consistently. 


Osaptcr  V - VK)I  INS.  BOU-BOU  SHRIKES,  AND  O I HER  AI’FEIt  A I IONS  Page  103 


a note  of  1695  Hz  (c)clc-fre(|uency  dispersion  = 13.26  Hz  which  is  .7S',’6  -of  the  average  cycle- 

frequency!).  Ill  summary,  we  find  iluit  both  the  icmporal  resolution  and  cyele-frequcncy  resolu- 
tion inhc.ciit  in  our  time-domain  techniques,  are  in  fact  required  lor  sufficient  accuracy  in  a 
detailed  characterization  of  the  bou-bou  shrike  vocalization.  We  feel  that  this  situation  is  in  no 
way  unique,  and  that  ipplieation  of  the.se  techniques  to  other  animal  vocalizations  may  well 
disclose  many  new  features,  patterns,  and  acoustic  relationships  in  general.  In  addition,  this 
precise  acoustic  information  (for  example,  the  bou-bou  shrike  reaction  time,  or  consistency  of  their 
song  cycle-frcqueneies)  may  well  reflect  the  capabilities  of  physiological  receptor  and  production 
mechanisms,  as  they  are  naturally  utilized. 

Finally  we  leave  an  e.xcrcise  for  th“  reader!  The  same  kinds  of  comparative  visual  displays 
(two  pages  each)  follow,  in  the  same  order,  for  the  song  of  an  individual  song  sparrow  (Figs.50- 
S;).’  Tl  ; species  is  a familiar  song  bird  and  has  a complex  song  characterized  by  both  transients 
and  steady  . tates.  Here  with  higher  frequencies  present  than  in  the  case  of  the  bou-bou  shrike, 
more  compensation  mus‘  be  made  for  the  difference  between  the  linear  and  logarithmic  scales  . In 
comparing  ’ different  displays,  note  differences  in  details,  many  of  which  occur  in  consistent 


patterns,  as  wcl  oifferenees  in  the  gross  fcature.s. 


®/®5  SONAGRAM  9 KAY  CLEMCTRICS  CO.  PINE  BROOK. 


f Chapter  V - VIOLINS,  HOU-HOU  SHRIKES,  ANI?  O I HER  AI'I'LIt'A  I IONS  Page  104 


1 

Z . 


BOU-BOU  SHRIKE  (Luniarius  aeihiopicus) 


t hapler  V . VIOLINS.  HOU-BOU  SHRIKES.  AND  OTHER  APPLiCATIONS  Page  105 


BOU-BOU  SHRIKE  {Lanurius  aethiopicus) 


W'WIBIIW 


Chapter  V - VIOLINS.  IIOD-HOU  SHRIKCS,  AND  O I HEK  API’LICA  I IONS  Page  1U6 


I 


(‘'ID  /fouanboj./  3,,-j 


CODS' 13SJjO 

UO.JI1  lan'j 

IN'MS 

n>onroi9“'So'i-  jnon 

o U:ii  3 (Ians  yOllMOB 


BOU-BOU  SMRIKI:  (Laniahus  aclhiapicus) 


TVPK  B/M  BONAaRAMB  KAV  EUKMKTIUCa  CO.  BINS  BROOK.  K.  A 


SONG  SPARROW  (Melaspiza  melodia) 


Chapicr  V - VIOLINS,  BOU-IlOU  SHRIKES,  AND  OI  IIER  APPLICA HONS  Page  109 


I 


I 


Co 


a 


o* 


( I I I { I I • 

(V-fc?  'S-v 


SONG  SPARROW  (Melospiza  melodia) 


TVFCB/M  SOMAOMAM*  KAY  BJKMrrfUCB  C<X  BINK  ■ 


Chapter  V - VIOLINS,  BOL-BOU  SHRIKES,  AND  OTHER  APPLICATIONS  Page  1 lU 


SONG  SPARROW  {Melospiza  melodia) 


Chapter  V - VIOLINS,  BOU-BOU  SlIRflwES,  AND  OTIILR  APPLICA HONS  Page  1 1 1 


• >;1.  v'« 

• : 

* #1 .?  >»*  • 

■'.Airi 

m 

R - Ji}  'y 


i A I I ii  I ^ ^ ^ 


:4io 


• t . *-V‘  •*!* 


<z>^ 


4** 


T 


r: 

■ ••# 
m-.. 


•>  I 


» i 


r I 


j I 


I ' I 


O I 


K 


I 


if  ^ J cl  3 

\ 


TQJil'J  si  J cl  if 

t 

»ii* 


•M  MKIJO 
•kill  lifM', 
Iter. 

I>w»if.)l».|lrj'<  1. 1'.', 

■ Ml  f«W)jO  'JNII', 


SONG  SI’ARROW  (Melospiza  meloJia) 


"77 me  Csert'')^ 


Chapter  V - VIOLINS,  BOC-ISOU  SHRILLS,  AND  OI  HER  APPLICATIONS  Pase  t iz 


A 

\ 


^pHfKPinijj 


O’  I 


SONG  SPARROW  (Meiospiza  meiodia) 


Time  (sec) 


Chapter  V - VIOLINS.  BOU-BOL  SHRIKi  :S,  AND  OTHER  APPLICATIONS  Page  113 


CONCLUDING  COMMENTS 

As  we  have  previously  staled,  the  lime-domain  techniques  presented  here  are,  of  cource, 
generalizable  to  any  waveform.  In  light  of  both  the  theoretical  reasons  and  empirical  results 
previously  discussed,  we  recommend  that  these  time-domain  techniques  are  most  suited  to 
studying  waveforms  where  quickly  changing  or  transient  phenomena  and/or  precise  temporal 
measures  are  of  primary  interest. 

Despite  the  historical  prevalence  and.  in  fact,  nearly  exclusive  use  of  frequency-domain 
analyses  for  signal  waveforms,  we  have  attempted  to  demonstrate  that  certain  kinds  of  useful  and 
heretofore  untapped  information  arc  uniquely  available  in  the  lime-domain.  We  have  chosen 
human  speech  as  the  chief  vehicle  for  this  demonstration,  largely  because  we  know  much  more 
about  the  information  borne  by  these  complex  v/aveforms  than  we  generally  know  about  most 
othcis.  As  a means  for  exploring  these  techniques,  wc  have  addressed  in  detail  certain  challenging 
problems  in  the  areas  of  speech  charac  erization,  segmentation,  and  allophonc  differcnualion.  In 
addi'ion,  wc  have  briefly  examined  other  kinds  of  acoustic  .vaveforms  amenable  to  such  analysis 
me>  lods.  It  appears  to  us  that  a great  deal  more  information  in  the  time-domain  remains  to  be 
explored,  and  potentially,  quite  prcMlably  so. 

We  strongly  recommend  however,  that  frequency-domain  analyses  and  time-domain  analyses 
should  be  used  in  a complementary  fashion.  Despite  their  redundancy,  each  of  these  domains  best 
conveys  different  and  essential  characteristics  of  complex  waveforms.  Wc  know  there  are  u 
number  of  known  biological  mechanisms  of  the  sensory  modalities  which  are  capable  of  respond- 
ing to  very  short  term  or  transient  phenomena,  as  well  as  others  which  respond  to  long  term 
stimuli.  In  addition,  we  know  that  a complex  eombinalion  of  steady  slates  and  transients  arc  found 
in  many  informaiion-ocaring  waveforms;  human  speech,  other  animal  vocalizations,  music, 
biomedical  measures  such  as  the  El£G  and  EKG,  recordings  of  physical  stale  changes  in  inorganic 
materials,  geological  seismic  recordings,  various  means  of  high-frequency  communications,  and 


many  more. 


Chapter  V - VIOLINS,  BOU-UOU  SHRIKES,  AND  OTHER  APPLICATIONS  Page  114 


We  feel  that  the  use  and  understanding  of  both  frequeney-domain  techniques  and  time- 
domain  techniques,  in  conjunction  with  each  other,  will  lead  to  better  analysis  and  characterization 
of  complex  waveforms,  and  probably  even  to  improved  synthesis  of  such  waveforms  as  well. 


DO  ANY  SAMPLES  CONTAIN  TRIDYniTE 


APPK.WDIX  A 

I 


I 


fv> 


Oi  4^  Ul  0>  -nI  00 


I 


ro 


CM  C71  or  -»J  00 


7^ 

N 


01 

u 

n 

0 

z 

<> 

0 


z 


p 

n 

z 

q 

a 

0 

« 

9 

■s 

z 

n 

n 

a 


? 


X 

N 


Page  115 


B35:  DO  ANY  SAMPLES  CONTAIN  TRIDYMITE  - 4a 


i 


Page  1 1 6 


I ' 


nr.  •’DO  snppus  coNiniN  TRinmiTE 
u».rp.<oci<iGi0Jwooi 

si.-NZ  Lou  Cycle  Frequency  (i Iz) 

0T..PI  T1M€*0 

orrsCT'O 


COUNT  UHERE  TYPE  EQUALS  LINEAR 


D7:  COUNT  WHERE  TYPE  EQUALS  LINEAR  EQUATIONS  AND  RUNTIME  LESS  THAN  FIVE  SIX 


UATION^S  AND  RUNT  I HE  LESS  THAN  FIVE  S 


— ro 


CM  4^  CJI  (7)  00 


7s 

X 


Page  ll« 


I 


D7:  COUNT  WHERE  TYPE  EQUALS  LINEAR  EQUATIONS  AND  RUNTIME  LESS  THAN  FIVE  SIX 


SEC 


D7:  COUNT  WHERE  TYPE  EQUALS  LINEAR  EQUATIONS  AND  RUNTIME  LESS  THAN  FIVE  SIX  - 


I 


Page  121 


D.'oncwmi  MCPt  ITPC  courts  LlMCrtP  EOUrtllON'i  OW  PUNlUt  USS  INIW  rivc  $n 

UN.'o.r«:ir«iojuooi 

SIkNI 

sii^Pi  iinc»« 

® Log  Cycle  Frequency  (I  Iz; 


■ F F prpi^i'i»R»  P prppFPP 


(• 


Page  122 


n.  .,. ' 

— vA 

N r.  N r.  N 

S/^  >-  'V — V *■ ' V^ 
■ f*  f f' 

^ - V l'  if  I .1  )| . V l|  ■ * ^ 

V'-{\/--V--\/- I -‘  . 1 

''  I II  f • I'v; 


^1’  V'*" 


V <ri'-  if'  ir  'I 


'*\/' — V^"A/^' — V 


i-.'  -f  j 


/^^  *.  /-^N  /'\~^\  /^  A A ' (AAy-^l/A-V.*!  V .v,*^  A A 

. \^  V V'  .j  IT  ^ ,(.  j, 

,.  Y'  ^ii'  iii> 

;:i;tMJ'-:’"-IIl!r'*;,V,l|l,,  ^ “ ' 

..mM.iI'I'I'Wi'’  I ! , 

i.  i -I  ’’ 

I I ' I I I ' 

— -j-- — 

fN  [.  f*  if-  I 


1...  li".,H 

,'■’  ir'  I'l 

'^r'^''''';;''i;''i,'''^'y^i''7’^^^ 


I WANT  TO  DO  FHONEniC  LABELING  0 


WANT  TO  DO  PHONEMIC  LABELLING  ON  SENTENCE 


ON  SENTENCE  SIX 


ro 


OJ  ^ cn  or  c» 


I 


Page  125 


I 


I 


I 


LSI:  I WANT  TO  DO  PHONEMIC  LABELLING  ON  SENTENCE  SIX  — 12b 


ru(^c  I 


ist'*iri  U.*(I  10  00  PMONCmC  IfHItllW  on  mnicnce  su 

I tNM.»0Ctn6l9JU<?0l 
SX'Nl 

SiMOt  line*  r?l9  , 

OEfscr.  0 Log  Lyde  f requency  (Hz) 


It  IZ  13  II  IS  r.|7|0p|l  |Z  13  |1  p r.pinioii 


.ziimi 


• ^ •• 
I ^ • 


■ML 

— f * C-A. 5 


• tl ~- 


. ' - 
\ J I • • ^ 


. vV'Zi'.*7i».’,  ■ 

* V V,’ 


**'  ; ,■  **.  .1.  ••  ' 
• . J*  I . ♦ 


*1  *•'■•..■  •?*v 

.*<  Vi  , *1 

f.S:  • • 


IlIL  „ J_  f 


h|VvJ1/wA  w 

) [M 

■ 


''*zvAwvAvV'/V 


/v7ViV/W/A>j;^(\r^  ^yv/'vjA.| 


'U 

1 


DISPLAY  THE  PHONEniC  LABELS  ABOVE  TH 


ro 


7^ 

X 

N 


Page  128 


(jj  [s.  fTi  CD  ~n]  CP 


ro  oj^(J1(J)-nJOo  X 

N 


LM13;  DISPLAY  THE  PHONEMIC  LABELS  ABOVE  THE  SPECTROGRAM  — 15a 


THh  FPECTROGRAH 


I 


ro  oj^cno^->ia5  ? 


L 


LM13:  DISPLAY  THE  PHONEMIC  LABELS  ABOVE  THE  SPECTROGRAM  - 15b 


TANGULAR  CYL 


Page  134 


r,T  X.'IRO  >nu  Mat  mNT  RtCliiNCaM  CtLINIXPS  itrt 

I IS,-. I wOCIfiOK’JWi'i)! 

mi'N: 


ri..P!  nnc*  IZ09 

Off  sc  I*  0 


Log  Cycle  Frequency  (Hz) 


I 


, iz  p r pm°Pi'  1-^  p nmn'  f nfm. 


• 

• 

. i 

• 

• 

• 

■Q 

P*  # */ 

.L';j 

; • *'•  •••V: 

•:  • i • % 3' 

'./'lA 

I'l 

V ' * 



•■1 

■f\  ••  ■■■  ■ 

. 7 

' 

' fS  -'.*  .•  ;•■ 

U 

* 0 . • 
.•  •.  . • 

! 0 

• *.  * • 

• 

1 

<?  - ' i • • . 

r*  :**  i" 

.t;  .•  • • 

u 

l-L'’ 

' : 

I r 

J.  * 

• 

%•  ; 

V 

% ■ ' V 

I * * * 

J_Z 

, 

• • • 

. • ■ ■ • • l'  ■ . 

• % ^ . : • • . 

13 

’•  . 0 • •*. 
■ • 1 ^ ' 
• • • 

Page  135 


Page  137 


Al’PKNDIX  B 


Senti  nee  Segmentation  Results 


UTTtRANCh'  HI 


' ) "Do  iiny  samples  contain  tridymitc? 
tntal  # primary  boundaries  = 33 


Aisolutf 

i n'V  1 at  ion 

( msec  ) of 

Primary  Uounciuries  by 

Phone  Class 

HJRAMS 

PAUSi-.S 

s r<  li's 

Vnwi  I S 

LIQUIDS 

FRICATIVES 

NASALS 

' .nciNG 

.)S 

9.2(A) 

2 . A ( /•  ) 

3.  3(  7 ) 

2.1(1) 

1 .8(  2 ) 

1.3(6) 

2.3( 3) 

j ; 

- ( 0 ) 

10.  J(  5 ) 

n.  3(  7 ) 

3. 3( 2 ) 

7.A( 2 ) 

19. 0{ A ) 

17.5( 2) 

1 ■ 

A . A(  2 ) 

('> . 5(  A ) 

lA  . I(  B ) 

7.3t  1 ) 

0.8( 2 ) 

7.3( A ) 

- (0) 

D 

'i . f ( 2 ) 

3.  /I  A ) 

/.  21  9 ) 

0.9(  1 ) 

5.7( 2 1 

5,  3{  5 ) 

3.0{  2) 

K 

8.U)  3 ) 

1 .5( 3 ) 

12.0( 8 ) 

9.A( 2 ) 

11.5( 2) 

5.A{ A ) 

5.9( 3 ) 

PH.MAKi'  K 

i.U  HOUNPAD 

: vS 

i'!\ir,HAMS 

SIP 

I'S 

FK  I r\\ , ; VFS 

, uS 

3 

• Mb) 

0.1(2) 

I' 

8 

. 51  t,  1 

17.2(  1 ) 

12 

. 3(  5 ) 

8.5(  2 ) 

1 

5 

.1(A) 

A.  il  2 ) 

K 

12 

.3(  6 ) 

1 1 .‘X  2 ) 

i'M'v.KAHS 

MISSED  BOUrJDAHlI.S 

KXIHA  BOUNDARIES 

MISSED  PLUS  EXTRAS 

a SECONDARY  BOUNDARIES 

IPS 

1 

a 

7 

6 

H 

1 1 

S 

16 

2 

12 

3 

15 

1 

D 

b 

13 

2 

8 

a 

12 

1 

Page  138 


I 


UTTERANCE  #2 


■■’ ) "Count  where  type  equals  linear  equations  and  runtime  less  than  five  six. 
total  D primary  boundaries  = 6a 


Aibsolute 

Deviation  ( 

msec)  of 

Primary  Boundaries  by 

Phone  Class 

PROGRAMS 

PAUSES 

SIOPS 

VtlWELS 

LIQUIDS 

FRICATIVES 

NASALS 

VOICING 

TDS 

0. 3(  10  ) 

1 .2( 8 ) 

1 1 .2(  13  ) 

4.3( 8 ) 

4.4(  7 ) 

1 .4( 7 ) 

0.9( 7 ) 

B 

14. 6( 6 ) 

10. 3( 8 ) 

12. 1(  16  ) 

9.2(  8 ) 

14.5(7) 

12. 3(  7 ) 

18.0( 6 ) 

C 

8.6( 3 ) 

6.6(2) 

7.8(  !:>  ) 

7.2( 5 ) 

10. C( 8 ) 

11.4(6) 

6.9(8). 

D 

6.0( 8 ) 

3.C)(  8 ) 

6.3(  15) 

5.7(  7) 

13.6( 7) 

6.0(6) 

12.0( 8 ) 

E 

10.5( 5) 

16.1(5) 

10.  1!  15  ) 

11.1(8) 

13.8( 5 ) 

10.9( 6 ) 

4.1(5) 

PRIMARY  END  BOUNDARIES 


PROGRAMS 

IDS 

H 

(' 


D 

E 


STOPS  FRICATIVES 

4.4(6)  8.6( 6 ) 

12.5(6)  17.8(7) 

5.4(7)  5.2(7) 

4.7(7)  5.9(7) 

6.4(5)  10.6(7) 


I'RtXJRAMS  MISSED  BOUNDARIES  EXTRA  BOUNDARIES  l.ISSED  PLUS  EXTRAS  # SECONDARY  BOUNDARIES 


IDS 

4 

8 

12 

11 

6 

12 

18 

r 

19 

0 

19 

D 

5 

18 

23 

E 

15 

7 

22 

UJ  Cr  NJ 


Page  139 


I 


UTTERANCE  tt3 


3)  "1  want  to  do  phonemic  labeling  on  sentence  six. 
total  It  primary  boundaries  = 38 


Absolute  Peviation  (msec)  of  Primary  Boundaries  by  Phone  Class 


PROGRAMS. 

PAUSES 

STOPS 

VOWELS 

LIQUIDS 

FRICATIVES 

NASALS 

VOICING 

I'DS 

1 .3( 3 ) 

0.1(5) 

3.9(  10  ) 

22. 8(  1 ) 

7.0(4  ) 

1.1(6) 

4.5(4) 

B 

- (0) 

10.0(4) 

10.6( 1 1 ) 

8.  K 3 ) 

5.3(4  ) 

6.3(5) 

2.3(3) 

C 

3.2(2) 

13.1(3) 

10.8(  1 1 ) 

11  .6(2) 

1 1 .8(  4 ) 

6.4( 6 ) 

2.8(4)' 

D* 

5.7(  1 ) 

6.2(  1 ) 

14.4( 6 ) 

33. 0(  1 ) 

- (0) 

12.2( 3 ) 

7.1(1) 

E 

8.0(  3 ) 

6.2( 3 ) 

15.3(8) 

7.4(2) 

25.2(4 ) 

13.7(4  ) 

0.8(  3) 

!>PIMARy  END  BOUNDARIES 


PROGRAMS 

STOPS 

ER I CAT IVES 

TDS 

9.8( 4 ) 

2.8( 3 ) 

B 

5.2( 5 ) 

2. 3(  3 ) 

r 

7.8( 6 1 

12.8(1) 

D* 

20. 6(  1 ) 

12. 3(  1 ) 

i: 

5.8( 3 ) 

18.9( 3 ) 

PROGRAMS  MISSED  BOUNDARIES  EXTRA  BOUNDARIES  MISSED  PLUS  EXTRAS 


SECONDARY  BOUNDARIES 


IDS 

5 

6 

1 1 

H 

8 

10 

18 

r 

6 

1 

6 

!)• 

8 

a 

16 

I, 

1 1 

10 

21 

» - 


only  about  half  of  utterance  segmented. 


U>  '«J 


Page  140 


UTTERANCE  #4 


4)  "Displey  the  phonemic  labels  above  the  spectrogram." 
total  H primary  boundaries  = 47 


Absolute 

Deviation  (msec)  of 

Primary  Boundaries  by 

Phone  Class 

PROGRAMS 

PAUSES  STOPS 

VOWELS 

LIQUIDS 

FRICATIVES 

NASALS 

VOICING 

TDS 

0.8(5)  0.1(9) 

3. 1( 10) 

9.0(  3 ) 

4.9(  6) 

0.4( 3 ) 

1.1(6) 

B 

6.4(2)  4.1(5) 

8.8(  1 1 ) 

5.  7(4) 

12.4(5 ) 

7.7( 3 ) 

8.0( 7 ) 

C 

20.0(1)  1.6(2) 

13.0(6) 

- (0) 

27. 2( 3 ) 

9.9(  1 ) 

4.5(  1 )• 

D* 

E 

22.8(3)  14.8(7) 

10. 4(  10) 

13. 6( 3 ) 

18.3( 5) 

15.9( 2 ) 

10.1(5) 

PRIMARY  1 

BOUNDARIES 

PROGRAMS 

STOPS 

FRICATIVES 

IDS 

4.9( 5 ) 

2.8( 7 ) 

B 

9.9( 6) 

8.3(6) 

r 

- (0) 

18.7(  3) 

!)• 

E 

17.1(7) 

16. 0( 7 ) 

PROGRAMS 

MISSED  BOUNDARIES 

EXTRA  BOUNDARIES 

MISSED  PLUS 

EXTRAS  # 

SECONDARY  BOUNDARIES 

TDS 

5 

7 

12 

1 1 

B 

10 

6 

16 

0 

(' 

33 

1 

34 

0 

D* 

E 

12 

2 

14 

0 

- utterance  segmentation  results  not  available 


Page  141 


UTIEKANCK  #5 


‘j  ) "Do  you  h.ive  any  rectangular  cylinders  lefty" 
Lotvil  s primary  boundaries  = 3*1 


Absolute  Deviation  ( msi'c  I ol  I’rimary  Boundaries  by  Phone  Class 


I'kOGRAMS 

PAUSES 

STOPS 

V(  HvT-'LS 

LIQUIDS 

FRICATIVES 

NASALS 

VOICING 

IDS 

5.01 2 ) 

3.0(6) 

S.8(9) 

7.9( 5) 

5.4(4) 

4.5(  3 ) 

2.9(2) 

B 

11).  U 5 ) 

7.7(4) 

12.1-1  10  ) 

5.7( 4 ) 

8.9( 3 ) 

13.3(  3) 

- (0) 

( : 

Dt.U(  1 ) 

18.1(2) 

1 / .0(  5 ) 

- (0) 

16.1(4) 

2.5(  1 ) 

13.1(1) 

D 

10.81 3 ) 

7.  3(  5 ) 

1 1 .4(  n ) 

13.9(  5) 

11.4(4) 

10.9( 3 ) 

1 1 .3(  1 ) 

E 

5.5(4  ) 

6.4(2) 

18.8(  8 ) 

17.5( 5) 

12.0( 3 ) 

24 ,0(  1 ) 

3.5( 1 ) 

I'ROGRAMS  MISSED  BOUNDARIES  EXTRA  BOUNDARIES  MISSED  PLUS  EXTRA  (i  SECONDARY  BOUNDARIES 


3 

13 

16 

5 

15 

20 

20 

5 

25 

1 

19 

. 21 

10 

19 

10 

3 

0 

0 

0 


r 


Page  1^2 


APPENDIX  C 


THE  6 PARAMETERS  OF 


REFERENT  ALLOPHONES  FOR  PAIR-WISE 


PHONE 


RECOGNITION 


STUDY 


STOP  CONSONANTS: 


MALE  1 


PHONE 

CYCLE- FREg 

CYCLE-FREQ  OISP 

DURATION 

TV  (NORMALIZED) 

MICROSTRUCTURE 

ABSAMP 

P #1(Hz) 

P O’’ 

P K 3 (msec) 

P D4 

P 1(5 

P #6 

/g/ 

1672 

720 

82.9 

614 

1 17 

97 

/K/ 

laS5 

873 

70.3 

616 

168 

49 

/b/ 

2296 

463 

6.0 

774 

115 

o6 

/P/ 

2485 

322 

33.4 

787 

76 

77 

/d/ 

2770 

1084 

35.1 

997 

195 

94 

/t/ 

3404 

1523 

101 . 1 

991 

169 

1 18 

MALE 

2 

''g/ 

1436 

428 

35.5 

468 

88 

88 

/k/ 

1577 

469 

154.6 

535 

57 

,214 

/b/ 

1562 

478 

15.7 

686 

135 

31 

/P/ 

2046 

544 

110.2 

707 

82 

1 10 

/d/ 

2144 

754 

60.0 

721 

108 

70 

/t/ 

3218 

1544 

59.  e 

1088 

246 

27 

FEMALE 

: 1 

/g/ 

1872 

355 

22.5 

509 

143 

112 

/k/ 

1551 

260 

81.8 

489 

101 

82 

/b. 

1478 

454 

1 ?.3 

559 

81 

44 

/P/ 

1911 

935 

32.8 

666 

114 

37 

/d/ 

2201 

551 

29.3 

563 

183 

73 

/t/ 

2708 

1083 

88.7 

604 

139 

48 

L 


S rop  CONSONANTS 


I’aqe  14. 


ALLOPIIONL  RA  I IOS 
male#  I 


CVCLE-KRLO  DISP 


CY^LE-FREO 

duration 

A l.O^b 

1 .H89 

0.<408 

0.866 

0.378 

0 ^ .'o 

‘ 1.123 

„ a.6iH 
1:  0.759 

1 .7fii^ 

< .27  6 

0 6H5 

V* . 558 

G .6  V 

a . 1 1 

' 1.342 

2.091 

0.411 

” 0.935 

1 . 69  .'4 

(■.397 

TV  (NORMALIZED)  ABSAMP 

' MICROSTRUCTURE 


1.60/ 

1.029 

0.973 

0.824 

0.7  29 

(4.376 

1 . Ob  '4 

1 . vj  ’J'  ) 

(I.  940 

0.  BOJ 

0 . Et‘..  3 

0.770 

(/.BOO 

0 . 7 (4  6 

o.ezj 

1.331 

2.  130 

0.  1G2 

1.038 

1.597 

1.  129 

STANDARD  DEVIATIONS 


N 

n 

T 

W 


P#I(Hz) 

P«2 

P #3(msr 

1 . v4 1 > 4 

0 .4  20 

0 . 1 U. 

U . (io3 

0 . u 37 

0 . 105 

ij  A)-j  t 

(.  .or-c 

{' . 0 .ij 

0 ,o:-7 

0.114 

0 . l.'l 

q . Of  - w 

0 . 1 U (: 

0.04  J 

u.U.  1 

0 . (4  ^ 4 

(*.2v.O 

C .1/3  6 

1.25n 

0 .0  ly 

p «4 

P#5 

r #6 

0.312 

14.32  5 

0.072 

O.OUG 

O.Cl.- 

(4.0  24 

0 . 0 3 6 

(/.0(/2 

0.115 

O.OtO 

0.  1/.7 

0.092 

0 . 03(y 

0,025 

0.054 

0 . 0 j6 

1.231 

U.0(4l 

O.OCu 

0 . o 5 0. 

0.902 

Pac?  J4|4| 


ALLOPHONE  RATIOS 
MALE  #2 


P#2 

P*  3 (msec ) 

Pik 

p#5 

?#6 

1.696 

1.387 

0.717 

1.666 

1.383 

0.654 

1.337 

1 .435 

0.076 

1. 132 

1.  16  5 

1.075 

1.06B 

1 . J6L 

G . 960 

1.032 

0.957 

1.252 

0.703 

(1 .65'^ 

0.434 

0.771 

0 .056 

0.753 

0.766 

0.49  1 

1 .00  3 

0.763 

0.378 

1.561 

1 .065 

1.558 

0 .54  7 

1.120 

1.838 

0.6  16 

0 .i^50 

1.105 

0.497 

0.887 

0.975 

1.265' 

STANDARD  DEVIATIONS 


P#1 

P#2 

P#3 

pik 

P#5 

P#i 

C.003 

0.007 

1.461 

0.  100 

0.956 
0.00  4 

0.023 

0.OO2 

0.()00 

0.398 

0.^10 

0.028 

6.026 

6.307 

0 .4  2<) 

0.020 

t< . 188 

0.759 

0.0  17 

f'.  16P 

0.  127 

0.006 

6.223 

0.171 

C.024 

C.074 

0.87  7 

0.012 

6.016 

2.086 

0.151 

0.7  69 

0.411 

0.081 

6.0G9 

0.201 

0.u23 

0.324 

0.274 

0.019 

0.226 

1.195 

Page 


Al  l. 01*1  IONL:  KAMOS 
I I MALI-  fil 


P-'7 

3 (nsf?c  ) 

P/-4 

p#  5 

?#6 

1,7  i- 

1 

I.Ool 

3.  Ip 3 

1.061 

C U,b7  7 

V,h'J7 

u.  jvu 

t>.  77^ 

u.iise 

0.471 

M 

' < /"J  1 « / 5 

I . oL- 

1.057 

1 1 

1 . 104 

U.'-J  U/ 

•>  . 3i*  1 

G - 112  *7 

t . 

1 . 2 4 1? 

* * • ^ 

1 . 1 G j 

/s,' 

0.  ‘/bV 

'll . 66'*j 

0 , «5  1 

1 . 

kI  . ho  t> 

1.09'j 

j ^ fi  1 

0 . 75  1 

' . J N 

> . 6GO 

O.P  T/ 

U . GO  Z 

1.251 

SI  AM.'AKI)  DI MAHONS 


;■>;'  1 

f ■'  2 

P'^i 

'.  125 

. ' . t:  0 / 

..  .Or  7 

C 

0 . . 1 ' ‘ ) 

U . V.'  0 1 

0 . 1 0 / 

- 

u . ! . 

.65  1 

' 

t. . : /■ : 

' . i’.-t 

' .Vi  / 

R 

W r U V • ‘C> 

■ . 37  0 

1 2 *4 

T 

U • V 1 \ ' 

1 .i 

0 . 0 u 7 

. . ■:■  / 

1 ' . 3 

Pf 

P^5 

P#6 

U . ( ( / 

t .t16 

b . 1 ’K' 

0.0^3 

0.373 

U .063 

0.0  26 

0.5  6-4 

0.143 

0.035 

0.04  4 

0 . PZ6 

0 .(M'-Z- 

0 .OP 7 

0.022 

0.0P6 

u . 1 3 0 

u . 004 

0 . 027 

0.  10  3 

0 . 4J2 

Page  1 


THE  6 PARTU^ETERS  OF  REFERENT  ALLOPHONES  FOR  PAIR-WISE  PHONE  RECOGNITION  STUDY 


FRICATIVES: 
MALE  tn 


PHONE 

CYCLE-FREQ 

CYCLE-FREQ 

r It!  (Hz) 

P #2 

/f/ 

a3ai 

2237 

/V/ 

3667 

2ao9 

/o/ 

2667 

170a 

/•V 

2035 

2069 

/V 

38a5 

1559 

/V 

3585 

1307 

/s/ 

5198 

1189 

/z/ 

agi4 

1253 

DISP 

DURATION  TV 

1 NORMALIZED ) 

P )t3(msec) 

P (ta 

18a.  7 

1359 

a5.0 

1166 

182.0 

1079 

92.2 

920 

191.2 

1 1 1 1 

125.7 

1081 

181  .a 

1372 

158.9 

1333 

MICROSTRUCTURE 

ABSAMP 

p Its 

P H6 

315 

38 

313 

63 

324 

32 

348 

43 

169 

202 

176 

256 

223 

284 

234 

191 

MALE  n2 


/f/ 

2646 

1113 

164 

/V/ 

1964 

1079 

75 

/<r/ 

2520 

1298 

182 

/■V 

17/0 

565 

16 

/(/ 

3015 

864 

274 

/■■/ 

3167 

781 

127 

/s/ 

3735 

827 

294 

/z/ 

4378 

1355 

00 

FEMALE  tn 


/f/ 

2562 

1020 

139 

/V/ 

1118 

all 

9 

/■/ 

1390 

84  7 

46 

/ ■/ 

bul 

651 

72 

/// 

3744 

576 

262 

/,</ 

2905 

1135 

118 

/s/ 

2690 

1411 

191 

/z/ 

2778 

2096 

54 

8 

913 

175 

42 

1 

769 

186 

28 

0 

1011 

265 

34 

5 

656 

110 

76 

6 

938 

125 

285 

3 

972 

1 30 

125 

3 

1221 

174 

98 

4 

1236 

221 

56 

7 

909 

199 

32 

4 

373 

38 

42 

4 

. 561 

146 

59 

9 

364 

128 

30 

7 

1097 

140 

159 

4 

926 

160 

82 

6 

973 

253 

40 

6 

840 

234 

83 

KICATIVUS 


ALLOIMIONH  UAIKJS 


Page  1^17 


MALE#1 


P^KHz) 

p#2 

3 (msec ) 

P#4 

p#5 

?I6 

1 . j39 

1 .121 

0.965 

1.002 

1.025 

0.699 

w 

0.917 

0.90b 

0.784 

0.958 

0.!/5.'» 

j . 640 

R 

N 

0.91 5 
1 .005 

0.9.T9 

0.063 

0.669 

0.6«f0 

0.929 

0.95.1 

1.021 

0.960 

0 . 052 
0.557 

STANDARD  DEVIATIONS 


P#1 

P#2 

P#3 

P»k 

P#5 

P#6 

0 .02-^ 
O.U03 
0.000 
0.133 

0.086 
0.0  24 
0.025 
0 .107 

0.115 
O.G34 
0 . 009 
(1.002 

0 . 0 1 5 
O.OUl 
0.022 
o.oua 

0.(<4(/ 
0.013 
0.03  1 
0.07.L 

U.((42 
0.010 
0 .t.3t> 
0.02b 

MALE  «2 


ALLOPHONE  RATIOS 


P#l(Hz)  P*2 


VI  C . 9fl  1 
R 1.00'/ 

H 

j 1.01'/ 


1.  108 
0.9  18 
0 .929 
1 .or. 


pi3(msec)  P*i* 

0.096  0.906 

O.eoS  U.966 

0.8T1  0.987 

0.95  2 I.OviO 


STANDARD  DEVIATIONS 


p#5  ?*6 

I.OttS  0.96'i 

0.905  1.051 

0. 923  0.969 

1. Zu2  0.743 


P*1 

pl2 

P#5 

P#4 

P#5 

P#6 

W 

(;  .036 

0.236 

0 .091' 

0.021 

(/.  131 

0.  no 

R 

(j  ,027 

0.016 

0 .05/ 

0.C03 

0.015 

0.073 

M 

0.031 

0.072' 

U.  16* 

0.007 

O.074 

0.090 

T 

0.003 

0.  140 

0.05'j 

O.OOG 

0 . 0 5 0 

0 .000 

FEMALE  «1 

ALLOPHONE  RATIOS 

P-»l(Hz) 

p#2 

P#3(msec) 

P#4 

p#5 

?f6 

U 

0 .96  4- 

0.949 

i< . 63  b 

0.926 

0.937 

1.059 

R 

1.063 

1.433 

1 .025 

1.362 

0 . 029 

1.508 

H 

1.2b  If 

0.915 

1.  141 

1.  121 

1.004 

2.191 

T 

0.939 

1.124 

(;.622 

1 .013 

1.141 

0.5*52 

/ 

S I A* 

’^ARD  DEVIATIONS 

PH 

P»2 

P#3 

P»4 

P^5 

PI6 

W 

•0.260 

0 . 0 2b 

0.085 

0 .OHO 

C.(.27 

O.04Z 

R 

0.063 

0 . 66  0 

0.061 

0.(,75 

0.009 

1.325 

N 

0.370 

U.0  99 

O.ftib 

0.0  57 

0.  163 

6.S31 

T 

O.O60 

0.071 

O.O44 

0.030 

0.059 

0.064 

Pa^jcUS 


STOP  CONSONAN  IS 


MAI  I 

G K U I’  I)  T 
c;i5  1 1 I J 

K 2 0 1 I 1 

H 9 

1>  1 i 

I)  1 1 1 1 19  3 

r 2 i 1 2 1 17 


PAIR-WISl, 

PIIONF.  COMI’ARISON 

TKSr 

RKSUI.TS 

MALI-  »2 

KKMAU’; 

H 1 

c. 

K B P D 

T 0 

K H P 

D 

T 

tiWi 

1 

CIS 

1 

1 

3 

K 

15 

K 2 

9 1 

1 

2 

B 

10  1 1 

1 B 1 

9 

2 

1 

P 

2 1 1 

3 P 

1 

7 

1 

I) 

GO 

1 1) 

2 

2 

10 

2 

T 

2 

13  T 

1 

2 

2 

7 

FRICATIVES 


PAiK-wisr  piioNr:  comparison  ti;st  results 


MALI-;  # 1 

MALL’ 

«2 

FFMALE 

F 

V c ( 

» 

S /. 

I- 

V 

V. 

> 

1 : 

S Z 

F V 

. ' ’ ■) 

/ 

s 

Z 

F21 

3 

F13 

1 

2 

1 

FI  9 

3 

1 

V 

17 

V 

13 

2 

1 

V 1 

C 

2 18 

1 

10 

1 

2 1 

6 2 

-> 

2 20 

7 

1 

8 

J 

2 1 

2 

1 

21  1 

1 

1 

15 

5 

1 

1? 

1 

13 

1 

1 

1 

1 3 

1 

s 

20 

S 

26  2 

S 1 

1 

1 

28 

o 

L 

2 2 1 

z 

1 

2A 

/,  1 

1 

1 

Page  149 


f 


BIBLIOGRAPHY 


(Al  ( Atal,  B.S.  and  S.L.  Hanaucr,  Speech  analysis  and  synthesis  by  linear  prediction  ot  the  speech 
wave,  JASA  55:  637-655.1974. 

(Bl)  Baker,  J.M,  J.K.  Baker,  and  J.Y.  Lettvin,  More  visible  speech,  JASA  52:  183  (A),  1972. 

(B21  Baker,  J.M.,  R.  Ramsey.  M.  Miller,  J.K.  Baker,  and  C.  Cooper,  Comparative  visual  displays 
of  time  and  frequency  domain  information  in  connected  speech,  JASA  55  (no  2):  (A),  1974. 

(B31  Baker,  J.M.  and  R.T.  Schumacher,  Computer  study  of  "jitter"  in  violin  and  cello  tones,  paper 
presented  at  "A  Topical  Conference  on  the  Teaching  of  Acoustics  and  the  Physics  of  Sound  and 
Music",  April  5-6,1974,  Univ.  of  Iowa,  Iowa  City,  Iowa. 

|B4]  Baker,  J.M.,  A new  time-domain  analysis  of  fricatives  and  stop  consonants,  Proc.  of  IEEE 
Symposium  on  Speech  Recognition,  Pittsburgh,  Pa.,  1974. 

[B5( , 1 ime-domain  acoustic  characteristics  of  allophoncs  and  phonological  phenomena,  JASA 

55:  (A),  Supplement,  Spring,  1974. 

(B6(  , Time-domain  analysis  and  segmentation  of  connected  speech,  Proc.  of  the  Speech 

Communication  Seminar,  Stockholm.  1974. 

(B7( . Autt'inatic  time-domain  techniques  for  segmentation  of  connected  speech,  JASA  56:  (A), 

Supplement.  Fall,  1974. 

(B8( . New  time-domain  analysis  for  complex  animal  vocalizations,  JASA  56;  (A),  Supplement, 

Fall.  1974. 

(B9|  Boomsliier,  P.C.  and  W.  Creel,  Research  potentials  in  auditory  characteristics  of  violin  tone, 
JASA  51:  1984,  1972. 

|CI  1 Chandra,  S..  Fxperimeiital  comparison  between  stationary  and  nonstationary  formulations  of 
linear  prediction  applied  to  voiced  speech  analysis.  IEEE  Trans.  Acoust.,  Speech,  Signal  Process- 
ing. ASSP-22  (no.6):  403-415.  1974. 

(C2(  Chang.  S.H.,  G.F  I’ihl,  and  J.  Wiren,  The  intervalgram  as  a visual  representation  of  speech 
sounds,  JASA  23  (no.  6).  675-679,  1951. 

(C3(  Chang,  S.H.,  Ci.  Pihl,  and  M.W.  Essigmann,  Representation  of  speech  sounds  and  some  of 
their  statistic  al  properties,  Proc.  I.R.E.  39:  147-153.  1951. 

(C4(  Chang,  S.II..  Two  scliemes  of  speech  compression  system.  JASA  28:  565-572,  1956. 

(Dl(  Davis,  11.  Peripliera!  coding  of  auditory  information,  in  Se/iwry  Co/n/iiunication  (W. 
Rosenblith.  Ld  ),  MU'  Press, I 19-141,1961. 

(D2(  Davis,  K li  . R.  Middulph,  and  S.  Balashck,  Automatic  recognition  of  spoken  digits,  JASA  32: 
1450-1455,  I960. 

(Ell  Frishkopf,  I..  and  M.  (ioldstein.  Responses  to  acoustic  stimuli  from  the  eighth  nerve  of  the 
bullfrog.  JASA  35:  1219-1228.  1963. 

(Gl)  Galambos,  R.  and  II.  Davis.  I he  response  of  single  auditory  nerve  fibers  to  acoustic 
stimulation,  J.  Ncuropliysiol,  69:  58,  1943. 


I 


Page  150 


|G2|  Gerstman,  I.J.,  Classification  of  self-noimalizcd  vowels,  lEHE  Trans.  Audio  Electroacoust., 
AU-16  (no.  1):  78-80, 1 %8. 

Ill]  Ito,  M R.,  Investigation  of  time  domain  measurements  for  analysis  of  speech,  Ph.D.  Thesis, 
Univ.  of  Britis  Columbia,  1971. 

[K  1 1 Kiang,  N. Y..S.,  DiHluiryc  I'anerns  of  Stnf^le  Fibers  in  the  Cal's 
Auducry  Merw,  MIT  Research  Monograph,  No.  35,  1965. 

IK2| , and  E.C.  Moxon.  f ails  of  tuning  curves  of  auditory  nerve  fibers,  presented  at  the  85th 

meeting  of  the  Acoustical  Society  of  America.  April  1,  1973. 

(K3)  Konishi,  M.,  Time  resolution  by  single  auditory  neurones  in  birds.  Nature  222  (no.  5193); 
566-567. 

|K4|  _ _,  Comparative  neurophysiological  studies  of  hearing  and  vocalizations  in  songbirds, 
Z.vergl  Physiologic  66:  257-272. 

(LI  I l.icklider,  J C R and  1 Pollock,  Effects  of  differentiation,  integration,  and  infinite  peak 
clipping  upon  the  intelligibility  of  speech,  JASA  20:  42-51,  1948. 

|1,2|  Licklider,  J C.R  . The  inleliigibilitv  of  amplitude-dichotomized  time-quantized  speech  waves, 
JASA  22i'7t20-823,  1950. 

(Ml)  Makhoul.  J 1 and  J J,  Wolf.  1 inear  prediction  and  the  spectral  analysis  of  speech.  Bolt, 
Bcranck,  and  Nevsman.  Inc..  Cambridge,  Mass.,  Report  #2304,1972. 

)M2I  Markcl,  J.D.,  Ihe  Protiv  method  and  its  applications  to  speech  analysis,  JASA  49:  105  (A), 
1971. 

1M3|  Morris.  1..R.,  The  role  of  zero  crossings  in  speech  recognition  and  processing,  Ph.D.  Thesis, 
Univ.  of  London,  1970. 

(M4I  Munson,  W.A..  and  11  C.  Montgomery,  A speech  analyzer  and  synthesizer,  JASA  22;  678 
(A).  1950. 

|P1|  Peterson,  E.,  Prequency  detection  and  speech  formants.  JASA  23:  668-674,  1951. 

]P2|  Pollock,  1..  Detection  and  rchative  discrimination  of  auditory  "jitter".  JASA  43;  308-315, 
1968. 

|R1 1 Reddy,  D.R.,  Segmentation  of  speech  sounds.  JASA  40  (no. 2):,  307,1966. 

|R21  Rose.  J.l'  . J.  Urugge.  1).  Anderson,  and  J Hind,  Phase-locked  responses  to  low-frequency 
tones  in  single  auditory  nerve  fibers  of  the  squirrel  monkey,  J.  Ncurophysiol.  30  (no.  4):  767-793, 
1967. 

|S1 1 Sakai,  T.  and  S Inoue,  An  analyzing  equipment  for  the  zero  crossing  interval  and  its  applica- 
tions to  speech  analyses,  J,  Inst,  Elect  Commun.  Engrs,  Japan  39.  404-409,  1956,  (in  Japanese), 
English  abstraction  in  Phjs.  Abst.  60:  110-11  57,1957. 

|S21  Schatz,  C D.,  (he  role  of  context  in  the  perception  of  stops.  Language  5;  47-56,  1954. 

1S3]  Shoup,  J.,  The  phonemic  interpretation  of  acoustic  phonetic  data,  Ph.D.  Thesis,  Univ.  of 
Michigan,  1964. 

(S4)  Stevens,  K.N.  and  A.  House,  I’erlnrbaiion  of  sowel  articulation  by  consonant  context;  an 
acoustic  study,  J of  Speech  and  llcaiing  Res.  6.  I i 1-128,  1963. 


Page  151 


«oi!jTsA  K rno^al:  °'  '"'  Ji«i"«ion  fo, 

NalLImThlm.ISS'™"'''  *'"*'"*  '"  "“‘io" 

win!  r f North.  Origin  and  significance  of  the  power  of  vocal  imitation- 

with  special  rclernce  to  the  antiphonal  singing  of  birds.  Nature  208;  219-222,  1965. 


